Paraphrase Detection on Noisy Subtitles in Six Languages

Eetu Sjöblom; Mathias Creutz; Mikko Aulamo

2018 EMNLP EMNLP 2018

Paraphrase Detection on Noisy Subtitles in Six Languages

Abstract

AbstractWe perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on other datasets, without reaching the same level of performance, because of domain mismatch between training and test data.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — noise-robust model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Eetu Sjöblom , Mathias Creutz , Mikko Aulamo

Topics

Deep Learning > Architectures > Neural Networks Natural Language Processing > Applications > Text Classification Machine Learning > Learning Types > Supervised Learning

Keywords

multilingual nlp paraphrase detection gated recurrent unit sentence embedding noise-robust model

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018