Identifying and analyzing ‘noisy’ spelling errors in a second language corpus

Alan Juffs; Ben Naismith

2025 NAACL NAACL 2025

Identifying and analyzing ‘noisy’ spelling errors in a second language corpus

Abstract

AbstractThis paper addresses the problem of identifying and analyzing ‘noisy’ spelling errors in texts written by second language (L2) learners’ texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Interdisciplinary

🧭 Keyword Pioneer — hurdle model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio