Worst-Case Analysis for Randomly Collected Data

Chen, Justin Y.; Valiant, Gregory; Valiant, Paul

Full-text links:

Download:

Current browse context:

cs.DS

< prev | next >

new | recent | 1911

Computer Science > Data Structures and Algorithms

Title: Worst-Case Analysis for Randomly Collected Data

Authors: Justin Y. Chen, Gregory Valiant, Paul Valiant

(Submitted on 9 Nov 2019 (v1), last revised 26 Oct 2020 (this version, v2))

Abstract: We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements $[n]={1,\ldots,n}$ with corresponding data values $x_1,\ldots,x_n$. We observe the values for a "sample" set $A \subset [n]$ and wish to estimate some statistic of the values for a "target" set $B \subset [n]$ where $B$ could be the entire set. Crucially, we assume that the sets $A$ and $B$ are drawn according to some known distribution $P$ over pairs of subsets of $[n]$. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution $P$ from which the sample $A$ and target sets $B$ are drawn, and the worst-case is with respect to the data values $x_1,\ldots,x_n$. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values--where the weights are functions of the distribution $P$ and the sample and target sets $A$, $B$--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative $\pi/2$ factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process, is a significant departure from typical estimation and introduces a uniform algorithmic analysis for the many natural settings where membership in a sample may be correlated with data values, such as when sampling probabilities vary as in "importance sampling", when individuals are recruited into a sample via a social network as in "snowball sampling", or when samples have chronological structure as in "selective prediction".

Subjects:	Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1911.03605 [cs.DS]
	(or arXiv:1911.03605v2 [cs.DS] for this version)

Submission history

From: Justin Chen [view email]
[v1] Sat, 9 Nov 2019 03:35:14 GMT (24kb)
[v2] Mon, 26 Oct 2020 16:05:01 GMT (83kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1911.03605

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Data Structures and Algorithms

Title: Worst-Case Analysis for Randomly Collected Data

Submission history