Type and Complexity Signals in Multilingual Question Representations

Robin Kokot; Wessel Poelman

2025 EMNLP EMNLP 2025

Type and Complexity Signals in Multilingual Question Representations

Abstract

AbstractThis work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions well in explicitly marked languages and structural complexity prediction, while neural probes lead on individual metrics. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce availability of pre-trained linguistic information.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — morphosyntactic property

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Robin Kokot , Wessel Poelman

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers

Keywords

multilingual representation structural complexity transformer model neural probing morphosyntactic property regression label

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025