A Bayesian non-parametric method for clustering high-dimensional binary data

Santra, Tapesh

Full-text links:

Download:

PDF only

Current browse context:

cs.LG

< prev | next >

new | recent | 1603

Statistics > Applications

Title: A Bayesian non-parametric method for clustering high-dimensional binary data

Authors: Tapesh Santra

(Submitted on 8 Mar 2016)

Abstract: In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.

Subjects:	Applications (stat.AP); Machine Learning (cs.LG); Machine Learning (stat.ML)
Cite as:	arXiv:1603.02494 [stat.AP]
	(or arXiv:1603.02494v1 [stat.AP] for this version)

Submission history

From: Tapesh Santra [view email]
[v1] Tue, 8 Mar 2016 12:02:59 GMT (1383kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> stat > arXiv:1603.02494

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Statistics > Applications

Title: A Bayesian non-parametric method for clustering high-dimensional binary data

Submission history