2017
EMNLP
EMNLP 2017
Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
Abstract
AbstractWe introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.
🌉
Interdisciplinary Bridge
— Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— data cleaning
🐣
Hot Topic Early Bird
— parallel corpus
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Core Methods > Classification
Machine Learning > Application Areas > Data Augmentation
Natural Language Processing > Applications > Machine Translation
Machine Learning > Core Methods > Feature Learning
Natural Language Processing > Generation > Machine Translation
Machine Learning > Learning Types > Classification