We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

cs.DS

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo

Computer Science > Data Structures and Algorithms

Title: Clustering without Over-Representation

Abstract: In this paper we consider clustering problems in which each point is endowed with a color. The goal is to cluster the points to minimize the classical clustering cost but with the additional constraint that no color is over-represented in any cluster. This problem is motivated by practical clustering settings, e.g., in clustering news articles where the color of an article is its source, it is preferable that no single news source dominates any cluster.
For the most general version of this problem, we obtain an algorithm that has provable guarantees of performance; our algorithm is based on finding a fractional solution using a linear program and rounding the solution subsequently. For the special case of the problem where no color has an absolute majority in any cluster, we obtain a simpler combinatorial algorithm also with provable guarantees. Experiments on real-world data shows that our algorithms are effective in finding good clustering without over-representation.
Comments: 10 pages, 6 figures, in KDD 2019
Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
ACM classes: I.5.3; G.1.6; H.2.8; F.2.2
Journal reference: in Proceedings of The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2019
DOI: 10.1145/3292500.3330987
Cite as: arXiv:1905.12753 [cs.DS]
  (or arXiv:1905.12753v1 [cs.DS] for this version)

Submission history

From: Alessandro Epasto [view email]
[v1] Wed, 29 May 2019 22:21:47 GMT (115kb,D)

Link back to: arXiv, form interface, contact.