2020
EMNLP
EMNLP 2020
Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
Abstract
AbstractTo scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model. This sampler utilizes all available sources of sparsity found in natural language - an important way to make computation efficient. We benchmark our method on a well-known corpus (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under four days.
🌉
Interdisciplinary Bridge
— Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— data-parallel sampler
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Optimization & Theory > Bayesian Inference
Machine Learning > Optimization & Theory > Distributed Learning
Natural Language Processing > Resources & Methods > Text Representation
Machine Learning > Optimization & Theory > Stochastic Methods
Machine Learning > Bayesian & Probabilistic > Bayesian Inference
Machine Learning > Core Methods > Topic Modeling