Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

Benjamin Marie; Atsushi Fujita

2017 ACL ACL 2017

Efficient Extraction of Pseudo-Parallel Sentences from Raw Monolingual Data Using Word Embeddings

Abstract

AbstractWe propose a new method for extracting pseudo-parallel sentences from a pair of large monolingual corpora, without relying on any document-level information. Our method first exploits word embeddings in order to efficiently evaluate trillions of candidate sentence pairs and then a classifier to find the most reliable ones. We report significant improvements in domain adaptation for statistical machine translation when using a translation model trained on the sentence pairs extracted from in-domain monolingual corpora.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — pseudo-parallel sentence

🐣 Hot Topic Early Bird — domain adaptation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

📈 Trend Setter — Domain Adaptation

Authors

Benjamin Marie , Atsushi Fujita

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Text Representation Machine Learning > Learning Types > Transfer Learning Machine Learning > Learning Paradigms > Domain Adaptation

Keywords

domain adaptation statistical machine translation word embedding translation model pseudo-parallel sentence monolingual corpus sentence extraction

Download PDF

Related papers

A* CCG Parsing with a Supertag and Dependency Factored Model 2017

Detecting annotation noise in automatically labelled data 2017

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2017

Annotating tense, mood and voice for English, French and German 2017

Word Embedding for Response-To-Text Assessment of Evidence 2017