Distributional Word Clusters vs. Words for Text Categorization

Ron Bekkerman; Ran El-Yaniv; Naftali Tishby; Yoad Winter

2003 JMLR JMLR 2003

Distributional Word Clusters vs. Words for Text Categorization

Abstract

We study an approach to text categorization that combines distributional clustering of words and a Support Vector Machine (SVM) classifier. This word-cluster representation is computed using the recently introduced Information Bottleneck method, which generates a compact and efficient representation of documents. When combined with the classification power of the SVM, this method yields high performance in text categorization. This novel combination of SVM with word-cluster representation is compared with SVM-based categorization using the simpler bag-of-words (BOW) representation. The comparison is performed over three known datasets. On one of these datasets (the 20 Newsgroups) the method based on word clusters significantly outperforms the word-based representation in terms of categorization accuracy or representation efficiency. On the two other sets (Reuters-21578 and WebKB) the word-based representation slightly outperforms the word-cluster representation. We investigate the potential reasons for this behavior and relate it to structural differences between the datasets. [abs] [pdf] [ps.gz] [ps] [data]

📈 Trend Setter — Clustering

🧭 Keyword Pioneer — information bottleneck

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — information bottleneck

Authors

Ron Bekkerman , Ran El-Yaniv , Naftali Tishby , Yoad Winter

Topics

Machine Learning > Core Methods > Classification Machine Learning > Core Methods > Clustering Natural Language Processing > Applications > Text Classification

Keywords

information bottleneck text categorization support vector machine distributional clustering word clustering word cluster

Download PDF

Related papers

Bottom-Up Relational Learning of Pattern Matching Rules for Information Extraction 2003

An Efficient Boosting Algorithm for Combining Preferences 2003

A Multiscale Framework For Blind Separation of Linearly Mixed Signals 2003

Word-Sequence Kernels 2003

An Extensive Empirical Study of Feature Selection Metrics for Text Classification 2003