MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets

Heuiyeen Yeen; Seokhee Hong; Hyeongu Yun; Jinsik Lee

2025 EMNLP EMNLP 2025

MANTA: A Scalable Pipeline for Transmuting Massive Web Corpora into Instruction Datasets

Abstract

AbstractWe introduce MANTA, an automated pipeline that generates high-quality large-scale instruction fine-tuning datasets from massive web corpora while preserving their diversity and scalability. By extracting structured syllabi from web documents and leveraging high-performance LLMs, our approach enables highly effective query-response generation with minimal human intervention. Extensive experiments on 8B-scale LLMs demonstrate that fine-tuning on the MANTA-1M dataset significantly outperforms other massive dataset generation methodologies, particularly in knowledge-intensive tasks such as MMLU and MMLU-Pro, while also delivering superior performance across a broad spectrum of tasks. Moreover, MANTA supports seamless scalability by allowing the continuous integration of web corpus data, enabling expansion into domains requiring intensive knowledge.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — instruction dataset

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Heuiyeen Yeen , Seokhee Hong , Hyeongu Yun , Jinsik Lee

Topics

Machine Learning > Application Areas > Data Augmentation Deep Learning > Techniques > Pretraining Natural Language Processing > Generation > Language Modeling Natural Language Processing > Resources & Methods > Large Language Models Computer Science > Applications > Software Engineering Machine Learning > Learning Types > Fine-Tuning

Keywords

data augmentation language model training web corpus language model instruction fine-tuning knowledge-intensive task instruction dataset dataset generation large language model high-performance llm

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025