Semi-supervised multiple testing

Mary, David; Roquain, Etienne

doi:10.1214/22-EJS2050

Full-text links:

Download:

Current browse context:

math.ST

< prev | next >

new | recent | 2106

Mathematics > Statistics Theory

Title: Semi-supervised multiple testing

Authors: David Mary, Etienne Roquain

(Submitted on 25 Jun 2021 (v1), last revised 24 Nov 2021 (this version, v2))

Abstract: An important limitation of standard multiple testing procedures is that the null distribution should be known. Here, we consider a
null distribution-free approach for multiple testing in the following semi-supervised setting: the user does not know the null distribution, but has at hand a sample drawn from this null distribution. In practical situations, this null training sample (NTS) can come from previous experiments, from a part of the data under test, from specific simulations, or from a sampling process. In this work, we present theoretical results that handle such a framework, with a focus on the false discovery rate (FDR) control and the Benjamini-Hochberg (BH) procedure. First, we provide upper and lower bounds for the FDR of the BH procedure based on empirical $p$-values. These bounds match when $\alpha (n+1)/m$ is an integer, where $n$ is the NTS sample size and $m$ is the number of tests. Second, we give a power analysis for that procedure suggesting that the price to pay for ignoring the null distribution is low when $n$ is sufficiently large in front of $m$; namely $n\gtrsim m/(\max(1,k))$, where $k$ denotes the number of ``detectable'' alternatives. Third, to complete the picture, we also present a negative result that evidences an intrinsic transition phase to the general semi-supervised multiple testing problem {and shows that the empirical BH method is optimal in the sense that its performance boundary follows this transition phase}. Our theoretical properties are supported by numerical experiments, which also show that the delineated boundary is of correct order without further tuning any constant. Finally, we demonstrate that our work provides a theoretical ground for standard practice in astronomical data analysis, and in particular for the procedure proposed in \cite{Origin2020} for galaxy detection.

Subjects:	Statistics Theory (math.ST); Methodology (stat.ME)
DOI:	10.1214/22-EJS2050
Cite as:	arXiv:2106.13501 [math.ST]
	(or arXiv:2106.13501v2 [math.ST] for this version)

Submission history

From: Etienne Roquain [view email]
[v1] Fri, 25 Jun 2021 08:41:02 GMT (1028kb,D)
[v2] Wed, 24 Nov 2021 20:40:08 GMT (1834kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> math > arXiv:2106.13501

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Mathematics > Statistics Theory

Title: Semi-supervised multiple testing

Submission history