Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Nguyen Xuan Vinh; Julien Epps; James Bailey

2010 JMLR JMLR 2010

Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Abstract

Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants. [abs] [ pdf ][ bib ] © JMLR 2010. (edit, beta)

🧭 Keyword Pioneer — normalized information distance

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

🐣 Hot Topic Early Bird — mutual information

Authors

Nguyen Xuan Vinh , Julien Epps , James Bailey

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Optimization & Theory > Statistical Learning

Keywords

mutual information normalized information distance clustering comparison information theoretic measure correction for chance

Download PDF

Related papers

A Fast Hybrid Algorithm for Large-Scale -Regularized Logistic Regression 2010

Model-based Boosting 2.0 2010

On Learning with Integral Operators 2010

Generalized Expectation Criteria for Semi-Supervised Learning with Weakly Labeled Data 2010

Hilbert Space Embeddings and Metrics on Probability Measures 2010