Concentration Bounds for Unigram Language Models

Evgeny Drukh; Yishay Mansour

2005 JMLR JMLR 2005

Concentration Bounds for Unigram Language Models

Abstract

We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O(k / m1/2). We improve its dependency on k to O(k1/4 / m1/2 + k / m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O( 1 / k + k1/2 / m). We derive a combined estimator, which has an error of approximately O(m-2/5), for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O(1 / m1/2), for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution. [abs] [ pdf ][ bib ] © JMLR 2005. (edit, beta)

📈 Trend Setter — Statistical Learning

🧭 Keyword Pioneer — language model

🐣 Hot Topic Early Bird — language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌱 Topic Pioneer — Language Modeling

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

Authors

Evgeny Drukh , Yishay Mansour

Topics

Machine Learning > Optimization & Theory > Learning Theory Machine Learning > Optimization & Theory > Statistical Learning Natural Language Processing > Generation > Language Modeling

Keywords

generalization error language model concentration bound good-turing estimator unigram language model

Download PDF

Related papers

Diffusion Kernels on Statistical Manifolds 2005

Learning with Decision Lists of Data-Dependent Features 2005

Multiclass Classification with Multi-Prototype Support Vector Machines 2005

Loopy Belief Propagation: Convergence and Effects of Message Errors 2005

Efficient Computation of Gapped Substring Kernels on Large Alphabets 2005