Towards better structured and less noisy Web data: Oscar with Register annotations

Veronika Laippala; Anna Salmela; Samuel Rönnqvist; Alham Fikri Aji; Li-Hsin Chang; Asma Dhifallah; Larissa Goulart; Henna Kortelainen; Marc Pàmies; Deise Prina Dutra; Valtteri Skantsi; Lintang Sutawika; Sampo Pyysalo

2022 COLING COLING 2022

Towards better structured and less noisy Web data: Oscar with Register annotations

Abstract

AbstractWeb-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.

🧭 Keyword Pioneer — register identification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Veronika Laippala , Anna Salmela , Samuel Rönnqvist , Alham Fikri Aji , Li-Hsin Chang , Asma Dhifallah , Larissa Goulart , Henna Kortelainen , Marc Pàmies , Deise Prina Dutra , Valtteri Skantsi , Lintang Sutawika , Sampo Pyysalo

Topics

Machine Learning > Core Methods > Classification Machine Learning > Application Areas > Data Augmentation

Keywords

noise reduction text preprocessing multilingual corpus register identification web data cleaning

Download PDF

Related papers

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation 2022

The Role of Context and Uncertainty in Shallow Discourse Parsing 2022

SelfMix: Robust Learning against Textual Label Noise with Self-Mixup Training 2022

Complicate Then Simplify: A Novel Way to Explore Pre-trained Models for Text Classification 2022

Repo4QA: Answering Coding Questions via Dense Retrieval on GitHub Repositories 2022