A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora

Eduard Barbu; Verginica Barbu Mititelu

2018 EMNLP EMNLP 2018

A hybrid pipeline of rules and machine learning to filter web-crawled parallel corpora

Abstract

AbstractA hybrid pipeline comprising rules and machine learning is used to filter a noisy web English-German parallel corpus for the Parallel Corpus Filtering task. The core of the pipeline is a module based on the logistic regression algorithm that returns the probability that a translation unit is accepted. The training set for the logistic regression is created by automatic annotation. The quality of the automatic annotation is estimated by manually labeling the training set.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — sentence filtering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Eduard Barbu , Verginica Barbu Mititelu

Topics

Machine Learning > Core Methods > Classification Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Machine Translation Machine Learning > Learning Types > Supervised Learning Machine Learning > Learning Types > Classification

Keywords

binary classification logistic regression machine translation parallel corpus supervised learning web corpus corpus filtering automatic annotation sentence filtering

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018