An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Dorman, Karin S.; Maitra, Ranjan

doi:10.1002/SAM.11546

Full-text links:

Download:

Current browse context:

stat.ME

< prev | next >

new | recent | 2006

Statistics > Methodology

Title: An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Authors: Karin S. Dorman, Ranjan Maitra

(Submitted on 6 Jun 2020 (v1), last revised 23 Jun 2021 (this version, v3))

Abstract: Mining clusters from data is an important endeavor in many applications. The $k$-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$-means objective function. We provide a novel, computationally efficient implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$-modes optimization.

Comments:	16 pages, 10 figures, 5 tables
Subjects:	Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
MSC classes:	62H30, 62Pxx, 62-04, 62-08
ACM classes:	I.5.3; G.3
DOI:	10.1002/SAM.11546
Cite as:	arXiv:2006.03936 [stat.ME]
	(or arXiv:2006.03936v3 [stat.ME] for this version)

Submission history

From: Ranjan Maitra [view email]
[v1] Sat, 6 Jun 2020 18:41:36 GMT (1863kb,D)
[v2] Sun, 15 Nov 2020 05:32:31 GMT (2289kb,D)
[v3] Wed, 23 Jun 2021 20:18:20 GMT (278kb,D)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:2006.03936

Download:

Current browse context:

Change to browse by:

References & Citations

Bookmark

Statistics > Methodology

Title: An Efficient $k$-modes Algorithm for Clustering Categorical Datasets

Submission history