Current browse context:
cs.DC
Change to browse by:
References & Citations
Computer Science > Distributed, Parallel, and Cluster Computing
Title: A Random Sample Partition Data Model for Big Data Analysis
(Submitted on 12 Dec 2017 (this version), latest version 20 Jan 2018 (v2))
Abstract: Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) to represent a big data set as a set of non-overlapping data subsets, i.e. RSP data blocks, where each RSP data block has the same probability distribution with the whole big data set. Then, the block-based sampling is used to directly select representative samples for a variety of data analysis tasks. We show how RSP data blocks can be employed to estimate statistics and build models which are equivalent (or approximate) to those from the whole big data set.
Submission history
From: Salman Salloum [view email][v1] Tue, 12 Dec 2017 06:49:28 GMT (1067kb,D)
[v2] Sat, 20 Jan 2018 10:59:15 GMT (2286kb)
Link back to: arXiv, form interface, contact.