TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Mohammad Aflah Khan; Ameya Godbole; Johnny Wei; Ryan Yixiang Wang; James Flemings; Krishna P. Gummadi; Willie Neiswanger; Robin Jia

2025 EMNLP EMNLP 2025

TokenSmith: Streamlining Data Editing, Search, and Inspection for Large-Scale Language Model Training and Interpretability

Abstract

AbstractUnderstanding the relationship between training data and model behavior during pretraining is crucial, but existing workflows make this process cumbersome, fragmented, and often inaccessible to researchers. We present TokenSmith, an open-source library for interactive editing, inspection, and analysis of datasets used in Megatron-style pretraining frameworks such as GPT-NeoX, Megatron, and NVIDIA NeMo. TokenSmith supports a wide range of operations including searching, viewing, exporting, inspecting, and sampling data, all accessible through a simple user interface and a modular backend. It also enables structured editing of pretraining data without requiring changes to training code, simplifying dataset debugging, validation, and experimentation. TokenSmith is designed as a plug-and-play addition to existing large language model pretraining workflows, thereby democratizing access to production-grade dataset tooling. TokenSmith is hosted on GitHub (https://github.com/aflah02/TokenSmith), with accompanying documentation and tutorials (https://aflah02.github.io/TokenSmith/). A demonstration video is also available on YouTube (https://www.youtube.com/watch?v=cDO8VE9fZvU)

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — data inspection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mohammad Aflah Khan , Ameya Godbole , Johnny Wei , Ryan Yixiang Wang , James Flemings , Krishna P. Gummadi , Willie Neiswanger , Robin Jia

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Resources & Methods > Text Representation Computer Science > Applications > Software Engineering Machine Learning > Application Areas > Model Compression Machine Learning > Core Methods > Feature Selection Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Core AI > Efficient Computing

Keywords

text representation model interpretability data preprocessing large language model pretraining datum data inspection dataset editing dataset management dataset inspection data editing megatron-style pretraining

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025