Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Chaitanya Chemudugunta; Padhraic Smyth; Mark Steyvers

2006 NIPS NeurIPS 2006

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Abstract

Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or seman- tic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the spe- ciﬁc words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over gen- eral topics, and (c) a distribution over words that are treated as being speciﬁc to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a speciﬁc word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match docu- ments at the speciﬁc word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representa- tions has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the docum

🚀 Conference Pioneer — NIPS 2006

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

📈 Trend Setter — Text Representation

🧭 Keyword Pioneer — information retrieval

🐣 Hot Topic Early Bird — unsupervised learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Chaitanya Chemudugunta , Padhraic Smyth , Mark Steyvers

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Resources & Methods > Text Representation Machine Learning > Bayesian & Probabilistic > Probabilistic Modeling Machine Learning > Core Methods > Probabilistic Modeling Natural Language Processing > Resources & Methods > Language Modeling Data Science & Analytics > Applications > Information Retrieval Natural Language Processing > Applications > Topic Modeling

Keywords

unsupervised learning latent dirichlet allocation topic modeling information retrieval document representation latent semantic analysis document modeling semantic indexing latent semantic indexing probabilistic topic model

Download PDF

Related papers

Temporal Coding using the Response Properties of Spiking Neurons 2006

Parameter Expanded Variational Bayesian Methods 2006

Effects of Stress and Genotype on Meta-parameter Dynamics in Reinforcement Learning 2006

Ordinal Regression by Extended Binary Classification 2006

Blind source separation for over-determined delayed mixtures 2006