2016
COLING
COLING 2016
Reddit Temporal N-gram Corpus and its Applications on Paraphrase and Semantic Similarity in Social Media using a Topic-based Latent Semantic Analysis
Abstract
AbstractThis paper introduces a new large-scale n-gram corpus that is created specifically from social media text. Two distinguishing characteristics of this corpus are its monthly temporal attribute and that it is created from 1.65 billion comments of user-generated text in Reddit. The usefulness of this corpus is exemplified and evaluated by a novel Topic-based Latent Semantic Analysis (TLSA) algorithm. The experimental results show that unsupervised TLSA outperforms all the state-of-the-art unsupervised and semi-supervised methods in SEMEVAL 2015: paraphrase and semantic similarity in Twitter tasks.
🌉
Interdisciplinary Bridge
— Machine Learning and Natural Language Processing
📈
Trend Setter
— Self-Supervised Learning
🧭
Keyword Pioneer
— n-gram corpus
🐣
Hot Topic Early Bird
— social media analysis
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Core Methods > Clustering
Natural Language Processing > Applications > Text Classification
Natural Language Processing > Resources & Methods > Lexical Semantics
Machine Learning > Core Methods > Dimensionality Reduction
Machine Learning > Learning Paradigms > Self-Supervised Learning