Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection

Thijs Brekhof; Xuanyi Liu; Joris Ruitenbeek; Niels Top; Yuwen Zhou

2024 SEMEVAL SemEval 2024

Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection

Abstract

AbstractIn this system description, we describe our process and the systems that we created for the subtasks A monolingual, A multilingual, and B forthe SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection. This shared task aimsat detecting and differentiating between machinegenerated text and human-written text. SubtaskA is focused on detecting if a text is machinegenerated or human-written both in a monolingualand a multilingual setting. Subtask B is also focused on detecting if a text is human-written ormachine-generated, though it takes it one step further by also requiring the detection of the correct language model used for generating the text.For the monolingual aspects of this task, our approach is centered around fine-tuning a debertav3-large LM. For the multilingual setting, we created an ensemble model utilizing different monolingual models and a language identification toolto classify each text. We also experiment with thegeneration of extra training data. Our results showthat the generation of extra data aids our modelsand leads to an increase in accuracy.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio