2025 EMNLP EMNLP 2025

COAS2W: A Chinese Older-Adults Spoken-to-Written Transformation Corpus with Context Awareness

Abstract

AbstractSpoken language from older adults often deviates from written norms due to omission, disordered syntax, constituent errors, and redundancy, limiting the usefulness of automatic transcripts in downstream tasks. We present COAS2W, a Chinese spoken-to-written corpus of 10,004 utterances from older adults, each paired with a written version, fine-grained error labels, and four-sentence context. Fine-tuned lightweight open-source models on COAS2W outperform larger closed-source models. Context ablation shows the value of multi-sentence input, and normalization improves performance on downstream translation tasks. COAS2W supports the development of inclusive, context-aware language technologies for older speakers. Our annotation convention, data, and code are publicly available at https://github.com/Springrx/COAS2W.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning and Natural Language Processing and Speech & Audio
🧭 Keyword Pioneer — spoken-to-written transformation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio