We gratefully acknowledge support from
the Simons Foundation and member institutions.
Full-text links:

Download:

Current browse context:

stat.ME

Change to browse by:

References & Citations

Bookmark

(what is this?)
CiteULike logo BibSonomy logo Mendeley logo del.icio.us logo Digg logo Reddit logo ScienceWISE logo

Statistics > Methodology

Title: An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Abstract: Mining clusters from datasets is an important endeavor in many applications. The $k$-means algorithm is a popular and efficient distribution-free approach for clustering numerical-valued data but can not be applied to categorical-valued observations. The $k$-modes algorithm addresses this lacuna by taking the $k$-means objective function, replacing the dissimilarity measure and using modes instead of means in the modified objective function. Unlike many other clustering algorithms, both $k$-modes and $k$-means are scalable, because they do not require calculation of all pairwise dissimilarities. We provide a fast and computationally efficient implementation of $k$-modes, OTQT, and prove that it can find superior clusterings to existing algorithms. We also examine five initialization methods and three types of $K$-selection methods, many of them novel, and all appropriate for $k$-modes. By examining the performance on real and simulated datasets, we show that simple random initialization is the best intializer, a novel $K$-selection method is more accurate than two methods adapted from $k$-means, and that the new OTQT algorithm is more accurate and almost always faster than existing algorithms.
Comments: 28 pages, 16 figures, 5 tables
Subjects: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
MSC classes: 62H30, 62Pxx, 62-04, 62-08
ACM classes: I.5.3; G.3
Cite as: arXiv:2006.03936 [stat.ME]
  (or arXiv:2006.03936v1 [stat.ME] for this version)

Submission history

From: Ranjan Maitra [view email]
[v1] Sat, 6 Jun 2020 18:41:36 GMT (1863kb,D)
[v2] Sun, 15 Nov 2020 05:32:31 GMT (2289kb,D)
[v3] Wed, 23 Jun 2021 20:18:20 GMT (278kb,D)

Link back to: arXiv, form interface, contact.