We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DC

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Computer Science > Distributed, Parallel, and Cluster Computing

Title: A Random Sample Partition Data Model for Big Data Analysis

Abstract: Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data set. Then, the block-based sampling is used to directly select representative samples for a variety of data analysis tasks. We show how RSP data blocks can be employed to estimate statistics and build models which are equivalent (or approximate) to those from the whole big data set.
Comments: 10 pages, 9 figures
Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
Cite as: arXiv:1712.04146 [cs.DC]
  (or arXiv:1712.04146v1 [cs.DC] for this version)

Submission history

From: Salman Salloum [view email]
[v1] Tue, 12 Dec 2017 06:49:28 GMT (1067kb,D)
[v2] Sat, 20 Jan 2018 10:59:15 GMT (2286kb)

Link back to: arXiv, form interface, contact.