Efficient Data Representation by Selecting Prototypes with Importance Weights

Gurumoorthy, Karthik S.; Dhurandhar, Amit; Cecchi, Guillermo; Aggarwal, Charu

Full-text links:

Download:

Current browse context:

stat.ML

< prev | next >

new | recent | 1707

Statistics > Machine Learning

Title: Efficient Data Representation by Selecting Prototypes with Importance Weights

Authors: Karthik S. Gurumoorthy, Amit Dhurandhar, Guillermo Cecchi, Charu Aggarwal

(Submitted on 5 Jul 2017 (v1), last revised 12 Aug 2019 (this version, v4))

Abstract: Prototypical examples that best summarizes and compactly represents an underlying complex data distribution communicate meaningful insights to humans in domains where simple explanations are hard to extract. In this paper we present algorithms with strong theoretical guarantees to mine these data sets and select prototypes a.k.a. representatives that optimally describes them. Our work notably generalizes the recent work by Kim et al. (2016) where in addition to selecting prototypes, we also associate non-negative weights which are indicative of their importance. This extension provides a single coherent framework under which both prototypes and criticisms (i.e. outliers) can be found. Furthermore, our framework works for any symmetric positive definite kernel thus addressing one of the key open questions laid out in Kim et al. (2016). By establishing that our objective function enjoys a key property of that of weak submodularity, we present a fast ProtoDash algorithm and also derive approximation guarantees for the same. We demonstrate the efficacy of our method on diverse domains such as retail, digit recognition (MNIST) and on publicly available 40 health questionnaires obtained from the Center for Disease Control (CDC) website maintained by the US Dept. of Health. We validate the results quantitatively as well as qualitatively based on expert feedback and recently published scientific studies on public health, thus showcasing the power of our technique in providing actionability (for retail), utility (for MNIST) and insight (on CDC datasets) which arguably are the hallmarks of an effective data mining method.

Comments:	Accepted for publication in International Conference on Data Mining (ICDM) 2019
Subjects:	Machine Learning (stat.ML); Machine Learning (cs.LG)
MSC classes:	65K05, 68W25
Cite as:	arXiv:1707.01212 [stat.ML]
	(or arXiv:1707.01212v4 [stat.ML] for this version)

Submission history

From: Karthik Gurumoorthy [view email]
[v1] Wed, 5 Jul 2017 05:17:10 GMT (307kb,D)
[v2] Sat, 14 Oct 2017 15:12:08 GMT (439kb,D)
[v3] Sat, 3 Feb 2018 10:10:45 GMT (569kb,D)
[v4] Mon, 12 Aug 2019 05:35:59 GMT (274kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:1707.01212

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Statistics > Machine Learning

Title: Efficient Data Representation by Selecting Prototypes with Importance Weights

Submission history