Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Junhan Kim; Chungman Lee; Eulrang Cho; Kyungphil Park; Joonyoung Kim; Yongkweon Jeon; Hoyoung Kim; Ho-young Kim

2024 NIPS NeurIPS 2024

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

Abstract

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile and TVs.Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyperparameter tunings are required.As a cost-effective alternative, learning-free PTQ schemes have been proposed. However, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a significant feature of Transformers.In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency.The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while targeting attention-wise reconstruction to consider the cross-layer dependency.Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models. The code will be available at https: //github.com/SamsungLabs/aespa.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🧭 Keyword Pioneer — layer-wise quantization

🐣 Hot Topic Early Bird — edge deployment

Authors

Junhan Kim , Chungman Lee , Eulrang Cho , Kyungphil Park , Ho-young Kim , Hoyoung Kim , Joonyoung Kim , Yongkweon Jeon

Topics

Artificial Intelligence > Core AI > Model Compression Deep Learning > Architectures > Transformers Machine Learning > Application Areas > Model Compression Deep Learning > Models > Large Language Models Deep Learning > Optimization & Theory > Optimization Deep Learning > Optimization & Theory > Model Compression Deep Learning > Optimization & Theory > Efficient Computing

Keywords

model compression post-training quantization edge deployment attention module large language model transformer model layer-wise quantization attention reconstruction

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024