Coping with Noisy Training Data Labels in Paraphrase Detection

Teemu Vahtola; Mathias Creutz; Eetu Sjöblom; Sami Itkonen

2021 EMNLP EMNLP 2021

Coping with Noisy Training Data Labels in Paraphrase Detection

Abstract

AbstractWe present new state-of-the-art benchmarks for paraphrase detection on all six languages in the Opusparcus sentential paraphrase corpus: English, Finnish, French, German, Russian, and Swedish. We reach these baselines by fine-tuning BERT. The best results are achieved on smaller and cleaner subsets of the training sets than was observed in previous research. Additionally, we study a translation-based approach that is competitive for the languages with more limited and noisier training data.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Teemu Vahtola , Mathias Creutz , Eetu Sjöblom , Sami Itkonen

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Classification Machine Learning > Learning Types > Fine-Tuning Natural Language Processing > Applications > Semantic Analysis

Keywords

text classification multilingual nlp paraphrase detection noisy label text similarity noisy label handling transformer model

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021