Standardizing Heterogeneous Corpora with DUUR: A Dual Data- and Process-Oriented Approach to Enhancing NLP Pipeline Integration

Leon Lukas Hammerla; Alexander Mehler; Giuseppe Abrami

2025 AACL AACL 2025

Standardizing Heterogeneous Corpora with DUUR: A Dual Data- and Process-Oriented Approach to Enhancing NLP Pipeline Integration

Abstract

AbstractDespite their success, LLMs are too computationally expensive to replace task- or domain-specific NLP systems. However, the variety of corpus formats makes reusing these systems difficult. This underscores the importance of maintaining an interoperable NLP landscape. We address this challenge by pursuing two objectives: standardizing corpus formats and enabling massively parallel corpus processing. We present a unified conversion framework embedded in a massively parallel, microservice-based, programming language-independent NLP architecture designed for modularity and extensibility. It allows for the integration of external NLP conversion tools and supports the addition of new components that meet basic compatibility requirements. To evaluate our dual data- and process-oriented approach to standardization, we (1) benchmark its efficiency in terms of processing speed and memory usage, (2) demonstrate the benefits of standardized corpus formats for NLP downstream tasks, and (3) illustrate the advantages of incorporating custom formats into a corpus format ecosystem.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — format conversion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Leon Lukas Hammerla , Alexander Mehler , Giuseppe Abrami

Topics

Deep Learning > Techniques > Model Architecture Natural Language Processing > Resources & Methods > Text Representation

Keywords

parallel processing microservice architecture nlp pipeline corpus standardization format conversion

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025