Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

Hiromu Takahashi; Shotaro Ishihara

2025 ACL ACL 2025

Quantifying Memorization in Continual Pre-training with Japanese General or Industry-Specific Corpora

Abstract

AbstractDespite the growing concern about memorization of training data using large language models (LLMs), there has been insufficient analysis under conditions using non-English or industry-specific corpora.This study focuses on continual pre-training, a common approach in building non-English LLMs, and quantifies memorization of training data.Specifically, we trained two models based on Llama 3 using Japanese Wikipedia (general) and Japanese financial news articles (industry-specific).Experiments showed a tendency for the amount of memorization to increase as training progressed, similar to the empirical findings for English.This trend was clear in the industry-specific corpus, suggesting potential risks when using valuable, non-general industry corpora.We also identified issues specific to Japanese, and emphasized the importance of analysis other than in English.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — industry-specific corpus

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio