← Resources & Methods

Natural Language Processing › Resources & Methods ›

Pretraining

72 directly classified papers

Papers per year

Papers

Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study ACL 2025

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction ACL 2025

Designing and Contextualising Probes for African Languages ACL 2025

Data-Efficient Selection via Grammatical Complexity in Continual Pre-training of Domain-Specific LLMs EMNLP 2025

VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search EMNLP 2025

DELTA: Pre-Train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment AAAI 2025

Training compute-optimal transformer encoder models EMNLP 2025

LP Data Pipeline: Lightweight, Purpose-driven Data Pipeline for Large Language Models EMNLP 2025

Unveiling the Potential of BERT-family: A New Recipe for Building Scalable, General and Competitive Large Language Models ACL 2025

On the Path to Make Ukrainian a High-Resource Language ACL 2025

MathPile: A Billion-Token-Scale Pretraining Corpus for Math NIPS 2024

REFeREE: A REference-FREE Model-Based Metric for Text Simplification COLING 2024

OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature EMNLP 2024

A Survey on Model Compression and Acceleration for Pretrained Language Models AAAI 2023

Encoder and Decoder, Not One Less for Pre-trained Language Model Sponsored NMT ACL 2023

Pre-training Multi-party Dialogue Models with Latent Discourse Inference ACL 2023

DLAMA: A Framework for Curating Culturally Diverse Facts for Probing the Knowledge of Pretrained Language Models ACL 2023

DocSplit: Simple Contrastive Pretraining for Large Document Embeddings EMNLP 2023

PairSpanBERT: An Enhanced Language Model for Bridging Resolution ACL 2023

Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling ACL 2023

DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains ACL 2023

DarkBERT: A Language Model for the Dark Side of the Internet ACL 2023

Knowledge-Selective Pretraining for Attribute Value Extraction EMNLP 2023

ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain ACL 2023

CLMSM: A Multi-Task Learning Framework for Pre-training on Procedural Text EMNLP 2023