References & Citations
Statistics > Methodology
Title: Multiple Hypothesis Testing To Estimate The Number of Communities in Sparse Stochastic Block Models
(Submitted on 12 Jan 2022)
Abstract: Network-based clustering methods frequently require the number of communities to be specified \emph{a priori}. Moreover, most of the existing methods for estimating the number of communities assume the number of communities to be fixed and not scale with the network size $n$. The few methods that assume the number of communities to increase with the network size $n$ are only valid when the average degree $d$ of a network grows at least as fast as $O(n)$ (i.e., the dense case) or lies within a narrow range. This presents a challenge in clustering large-scale network data, particularly when the average degree $d$ of a network grows slower than the rate of $O(n)$ (i.e., the sparse case). To address this problem, we proposed a new sequential procedure utilizing multiple hypothesis tests and the spectral properties of Erd\"{o}s R\'{e}nyi graphs for estimating the number of communities in sparse stochastic block models (SBMs). We prove the consistency of our method for sparse SBMs for a broad range of the sparsity parameter. As a consequence, we discover that our method can estimate the number of communities $K^{(n)}_{\star}$ with $K^{(n)}_{\star}$ increasing at the rate as high as $O(n^{(1 - 3\gamma)/(4 - 3\gamma)})$, where $d = O(n^{1 - \gamma})$. Moreover, we show that our method can be adapted as a stopping rule in estimating the number of communities in binary tree stochastic block models. We benchmark the performance of our method against other competing methods on six reference single-cell RNA sequencing datasets. Finally, we demonstrate the usefulness of our method through numerical simulations and by using it for clustering real single-cell RNA-sequencing datasets.
Link back to: arXiv, form interface, contact.