A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

Ximing Li; Bo Yang

2018 COLING COLING 2018

A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

Abstract

AbstractTraditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (PL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and employ the expectation maximization algorithm to train PL-DNB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e., preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — expectation maximization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ximing Li , Bo Yang

Topics

Machine Learning > Learning Types > Semi-Supervised Learning Natural Language Processing > Applications > Text Classification

Keywords

semi-supervised learning text classification expectation maximization pseudo label naive baye

Download PDF

Related papers

DialEdit: Annotations for Spoken Conversational Image Editing 2018

Downward Compatible Revision of Dialogue Annotation 2018

Zero Pronoun Resolution with Attention-based Neural Network 2018

Triad-based Neural Network for Coreference Resolution 2018

Challenges of language technologies for the indigenous languages of the Americas 2018