Using Word Embedding for Cross-Language Plagiarism Detection

Jérémy Ferrero; Laurent Besacier; Didier Schwab; Frédéric Agnès

2017 EACL EACL 2017

Using Word Embedding for Cross-Language Plagiarism Detection

Abstract

AbstractThis paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following: (a) we introduce new cross-language similarity detection methods based on distributed representation of words; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15% for English-French similarity detection at chunk level (88.5% at sentence level) on a very challenging corpus.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — cross-language plagiarism detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jérémy Ferrero , Laurent Besacier , Didier Schwab , Frédéric Agnès

Topics

Machine Learning > Core Methods > Metric Learning Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Applications > Text Classification

Keywords

distributed representation word embedding text similarity cross-language plagiarism detection textual similarity detection cross-language similarity detection

Download PDF

Related papers

Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages 2017

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension 2017

Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings 2017

Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit 2017

Assessing Convincingness of Arguments in Online Debates with Limited Number of Features 2017