QTIP: Quantization with Trellises and Incoherence Processing

Albert Tseng; Qingyao Sun; David Hou; Christopher De Sa

2024 NIPS NeurIPS 2024

QTIP: Quantization with Trellises and Incoherence Processing

Abstract

Post-training quantization (PTQ) reduces the memory footprint of LLMs by quantizing weights to low-precision datatypes.Since LLM inference is usually memory-bound, PTQ methods can improve inference throughput.Recent state-of-the-art PTQ approaches use vector quantization (VQ) to quantize multiple weights at once, which improves information utilization through better shaping.However, VQ requires a codebook with size exponential in the dimension.This limits current VQ-based PTQ works to low VQ dimensions ($\le 8$) that in turn limit quantization quality.Here, we introduce QTIP, which instead uses trellis coded quantization (TCQ) to achieve ultra-high-dimensional quantization. TCQ uses a stateful decoder that separates the codebook size from the bitrate and effective dimension. QTIP introduces a spectrum of lookup-only to computed lookup-free trellis codes designed for a hardware-efficient "bitshift" trellis structure; these codes achieve state-of-the-art results in both quantization quality and inference speed.

🧭 Keyword Pioneer — trellis coded quantization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

Authors

Albert Tseng , Qingyao Sun , David Hou , Christopher De Sa

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Model Compression Deep Learning > Models > Large Language Models Artificial Intelligence > Core AI > Efficient Computing Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Model Compression Deep Learning > Optimization & Theory > Efficient Computing

Keywords

model compression model quantization post-training quantization vector quantization memory footprint inference speed trellis coded quantization large language model

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024