Selective inference for k-means clustering

Yiqun T. Chen; Daniela M. Witten

2023 JMLR JMLR 2023

Selective inference for k-means clustering

Abstract

We consider the problem of testing for a difference in means between clusters of observations identified via k-means clustering. In this setting, classical hypothesis tests lead to an inflated Type I error rate. In recent work, Gao et al. (2022) considered a related problem in the context of hierarchical clustering. Unfortunately, their solution is highly-tailored to the context of hierarchical clustering, and thus cannot be applied in the setting of k-means clustering. In this paper, we propose a p-value that conditions on all of the intermediate clustering assignments in the k-means algorithm. We show that the p-value controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering in finite samples, and can be efficiently computed. We apply our proposal on hand-written digits data and on single-cell RNA-sequencing data. [abs] [ pdf ][ bib ] [ code ] © JMLR 2023. (edit, beta)

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing

Authors

Yiqun T. Chen , Daniela M. Witten

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Optimization & Theory > Theory Machine Learning > Optimization & Theory > Statistics

Keywords

cluster analysis k-means clustering hypothesis testing type i error selective inference

Download PDF

Related papers

Flexible Model Aggregation for Quantile Regression 2023

Efficient Computation of Rankings from Pairwise Comparisons 2023

Efficient Structure-preserving Support Tensor Train Machine 2023

Attacks against Federated Learning Defense Systems and their Mitigation 2023

How Do You Want Your Greedy: Simultaneous or Repeated? 2023