Document Similarity for Texts of Varying Lengths via Hidden Topics

Hongyu Gong; Tarek Sakakini; Suma Bhat; Jinjun Xiong

2018 ACL ACL 2018

Document Similarity for Texts of Varying Lengths via Hidden Topics

Abstract

AbstractMeasuring similarity between texts is an important task for several applications. Available approaches to measure document similarity are inadequate for document pairs that have non-comparable lengths, such as a long document and its summary. This is because of the lexical, contextual and the abstraction gaps between a long document of rich details and its concise summary of abstract information. In this paper, we present a document matching approach to bridge this gap, by comparing the texts in a common space of hidden topics. We evaluate the matching algorithm on two matching tasks and find that it consistently and widely outperforms strong baselines. We also highlight the benefits of the incorporation of domain knowledge to text matching.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — document matching

🐣 Hot Topic Early Bird — domain knowledge

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hongyu Gong , Tarek Sakakini , Suma Bhat , Jinjun Xiong

Topics

Machine Learning > Core Methods > Clustering Machine Learning > Core Methods > Representation Learning Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Applications > Information Retrieval Data Science & Analytics > Applications > Clustering Machine Learning > Learning Types > Representation Learning

Keywords

representation learning topic modeling domain knowledge text matching topic model document matching document similarity hidden topic model domain knowledge incorporation hidden topic abstraction gap hidden topics

Download PDF

Related papers

Economic Event Detection in Company-Specific News Text 2018

Investigating Effective Parameters for Fine-tuning of Word Embeddings Using Only a Small Corpus 2018

SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment 2018

Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer 2018

Affordances in Grounded Language Learning 2018