Distilling Structural Representations into Protein Sequence Models

Jeffrey Ouyang-Zhang; Chengyue Gong; Yue Zhao; Philipp Kraehenbuehl; Adam Klivans; Daniel Jesus Diaz

2025 ICLR ICLR 2025

Distilling Structural Representations into Protein Sequence Models

Abstract

Protein language (or sequence) models, like the popular ESM2, are now widely used tools for extracting evolution-based protein representations and have achieved significant success on core downstream biological tasks. A major open problem is how to obtain representations that best capture both the sequence evolutionary history and the atomic structural properties of proteins in general. We introduce **I**mplicit **S**equence **M**odel, a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based Autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2's pre-trained model. Notably, we make ISM's structure-enriched weights easily accessible for any application using the ESM2 framework.

Authors

Jeffrey Ouyang-Zhang , Chengyue Gong , Yue Zhao , Philipp Kraehenbuehl , Adam Klivans , Daniel Jesus Diaz

Download PDF

Related papers

Gramian Multimodal Representation Learning and Alignment 2025

Separation Power of Equivariant Neural Networks 2025

What should a neuron aim for? Designing local objective functions based on information theory 2025

Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds 2025

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL 2025