Robust and Flexible Statistical Methods for Localized Community Detection

M. Weylandt1
Abstract

We consider the problem of localized community detection (LCD): given a large network and a small set of important seed vertices, identify a local community of closely linked vertices, known as the goal block. Unlike traditional community detection, which seeks to identify large-scale structure in the entire graph, LCD only requires identification of local structure relevant to the seed set. To perform this task, we analyze random walks on the graph and nominate vertices with small mean hitting times as members of the goal block. We analyze this problem under a flexible, non-parametric graphon model and establish quite general conditions on the graph structure that allow successful LCD. Notably, our results impose only minimal structure on the graph outside of the goal block, yielding a robust and flexible statistical toolkit for LCD. As part of our analysis, we develop strong concentration bounds for hitting times in random graphs sampled from arbitrary graphons that may be of independent interest.

  • 1

    Department of Information Systems and Statistics, Baruch College, New York, NY USA
    [michael.weylandt@baruch.cuny.edu]

Keywords: Network Analysis – Community Detection – Seed-Set Expansion – Graphon – Non-Parametric

1 Seeded Community Detection

Consider a large, observed graph, 𝒢=(𝒱,). Given a small set of “seed” vertices, Ω0𝒱, localized community detection (LCD) attempts to identify a larger set of vertices, known as the goal block, Ω^Ω0, that form a relevant community. LCD, also sometimes called seed-set expansion because several important algorithms “expand outwards” from the seed set, has found important applications in fields as varied as web search, neuroscience, and sports analytics, and forms the basis of several well-known variants of the PageRank algorithm. [Andersen et al., 2006, Gleich, 2015]

Note that LCD answers a fundamentally different question than traditional Community Detection (CD). In CD, several large-scale communities which divide the entire graph are identified and then analyzed to determine relevance to a particular scientific question. [Abbe, 2018] By contrast, LCD attempts to find only a small group of nodes most relevant to the seed set and makes no inference about the rest of the graph. For example, if CD is applied to the full (multi-lingual) Wikipedia network, the communities identified are language based (e.g., English Wiki, Spanish Wiki, etc.). By contrast, applying LCD with a seed set of recent winners of the COPSS Presidents’ Award, we hope to find a community comprised of the entries of leading statisticians. While it is theoretically possible to identify fine-grained communities by finding an enormous number of communities in CD, computation and estimation difficulties make this infeasible. [Yoder et al., 2020]

2 Analysis of Random Walk Hitting Times via Graphon Concentration

Following the proposal of Whang et al. [2013], we consider a random walk approach to LCD. For each vertex vΩ0, we compute the mean hitting time of v to Ω0: that is, the expected number of steps a random walk on 𝒢 initialized at v takes to arrive at an element of Ω0. As noted by several authors, random walks are more adaptive to graph structure than more conventional metrics such as shortest-path distance. [Whang et al., 2013, Gleich, 2015, Su et al., 2017] In particular, random walks are able to adapt to the high degree skewness observed in real-world graphs and accurately capture the fact that relationships between nodes of different degree are highly asymmetric. As Foss et al. [2020] show, these hitting times can be computed efficiently requiring solving only a single linear system and avoiding the use of Monte Carlo techniques. This allows applications of LCD to extremely large graphs, even with only modest computational resources.

We analyze this approach under a flexible graphon model for the observed graph 𝒢. In particular, we consider 𝒢 as a finite-sample realization of an underlying inhomogeneous Erdös-Rènyi graph of infinite size. Building on prior analysis of Avella-Medina et al. [2020] who considered the graphon-limiting behvaior of the PageRank centrality measure, we first identify a linear integral equation (LIE) whose solution gives the expected hitting times for each vertex in 𝒢. Additionally, we establish concentration bounds on these hitting times and show that, as the graph size grows, the hitting times in any finite realization concentrate rapidly around the graphon limit.

Analyzing this LIE, we establish conditions on the underlying graphon which guarantee that hitting-time based LCD correctly identifies the true goal block with high probability. Notably, our conditions are quite flexible, essentially only requiring that i) a goal block exists; ii) the goal block is well separated from the remainder of the graph; and iii) that non-goal block vertices all share more edges with other non-goal block vertices than with the goal block. Because this analysis does not impose strong conditions on the non-goal portion of the graph, it can be applied broadly and without assuming particular stochastic block model-type structure. Experiments demonstrate that this additional robustness yields significant improvements in several applications of interest.

References

  • Abbe [2018] Emmanuel Abbe. Community detection and stochastic block models. Foundations and Trends in Communications and Information Theory, 14(1-2):1–162, 2018. doi: 10.1561/0100000067.
  • Andersen et al. [2006] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using pagerank vectors. In FOCS 06: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475–486, 2006. doi: 10.1109/FOCS.2006.44.
  • Avella-Medina et al. [2020] Marco Avella-Medina, Francesca Parise, Michael T. Schaub, and Santiago Segarra. Centrality measures for graphons: Accounting for uncertainty in networks. IEEE Transactions on Network Science and Engineering, 7(1):520–537, 2020. doi: 10.1109/TNSE.2018.2884235.
  • Foss et al. [2020] Alexander H. Foss, Richard B. Lehoucq, W. Zachary Stuart, J. Derek Tucker, and Jonathan W. Berry. A deterministic hitting-time moment approach to seed-set expansion over a graph. ArXiv Pre-Print 2011.09544, 2020. doi: 10.48550/arXiv.2011.09544.
  • Gleich [2015] David F. Gleich. PageRank beyond the web. SIAM Review, 57(3):321–363, 2015. doi: 10.1137/140976649.
  • Su et al. [2017] Yansen Su, Bangju Wang, and Xingyi Zhang. A seed-expanding method based on random walks for community detection in networks with ambiguous community structures. Scientific Reports, 7(41830), 2017. doi: 10.1038/srep41830.
  • Whang et al. [2013] Joyce Jiyoung Whang, David F. Gleich, and Inderjit S. Dhillon. Overlapping community detection using seed set expansion. In CIKM 2013: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, page 2099–2108, 2013. doi: 10.1145/2505515.2505535.
  • Yoder et al. [2020] Jordan Yoder, Li Chen, Henry Pao, Eric Bridgeford, Keith Levin, Donniell E. Fishkind, Carey Priebe, and Vince Lyzinski. Vertex nomination: The canonical sampling and the extended spectral nomination schemes. Computational Statistics & Data Analysis, 145, 2020. doi: 10.1016/j.csda.2020.106916.