2022 COLING COLING 2022

KOJAK: A New Corpus for Studying German Discourse Particle ja

Abstract

AbstractIn German, ja can be used as a discourse particle to indicate that a proposition, according to the speaker, is believed by both the speaker and audience. We use this observation to create KoJaK, a distantly-labeled English dataset derived from Europarl for studying when a speaker believes a statement to be common ground. This corpus is then analyzed to identify lexical choices in English that correspond with German ja. Finally, we perform experiments on the dataset to predict if an English clause corresponds to a German clause containing ja and achieve an F-measure of 75.3% on a balanced test corpus.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio