Testing Closeness With Unequal Sized Samples

Bhaswar Bhattacharya; Gregory Valiant

2015 NIPS NeurIPS 2015

Testing Closeness With Unequal Sized Samples

Abstract

We consider the problem of testing whether two unequal-sized samples were drawn from identical distributions, versus distributions that differ significantly. Specifically, given a target error parameter $\eps > 0$, $m_1$ independent draws from an unknown distribution $p$ with discrete support, and $m_2$ draws from an unknown distribution $q$ of discrete support, we describe a test for distinguishing the case that $p=q$ from the case that $||p-q||_1 \geq \eps$. If $p$ and $q$ are supported on at most $n$ elements, then our test is successful with high probability provided $m_1\geq n^{2/3}/\varepsilon^{4/3}$ and $m_2 = \Omega\left(\max\{\frac{n}{\sqrt m_1\varepsilon^2}, \frac{\sqrt n}{\varepsilon^2}\}\right).$ We show that this tradeoff is information theoretically optimal throughout this range, in the dependencies on all parameters, $n,m_1,$ and $\eps$, to constant factors. As a consequence, we obtain an algorithm for estimating the mixing time of a Markov chain on $n$ states up to a $\log n$ factor that uses $\tilde{O}(n^{3/2} \tau_{mix})$ queries to a ``next node'' oracle. The core of our testing algorithm is a relatively simple statistic that seems to perform well in practice, both on synthetic data and on natural language data. We believe that this statistic might prove to be a useful primitive within larger machine learning and natural language processing systems.

🌉 Interdisciplinary Bridge — Machine Learning and Mathematics & Optimization

🧭 Keyword Pioneer — identity testing

🐣 Hot Topic Early Bird — statistical learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bhaswar Bhattacharya , Gregory Valiant

Topics

Machine Learning > Optimization & Theory > Learning Theory Mathematics & Optimization > Mathematics > Information Theory Mathematics & Optimization > Mathematics > Probability Mathematics & Optimization > Mathematics > Statistics Machine Learning > Optimization & Theory > Statistics Mathematics & Optimization > Probability > Stochastic Processes

Keywords

statistical learning learning theory sample complexity hypothesis testing distribution testing l1 distance identity testing markov chain mixing time

Download PDF

Related papers

Data Generation as Sequential Decision Making 2015

A Recurrent Latent Variable Model for Sequential Data 2015

Combinatorial Cascading Bandits 2015

Accelerated Mirror Descent in Continuous and Discrete Time 2015

Matrix Completion with Noisy Side Information 2015