AFPQ: Asymmetric Floating Point Quantization for LLMs

Yijia Zhang; Sicheng Zhang; Shijie Cao; DaYou Du; Jianyu Wei; Ting Cao; Ningyi Xu

2024 ACL ACL 2024

AFPQ: Asymmetric Floating Point Quantization for LLMs

Abstract

AbstractLarge language models (LLMs) show great performance in various tasks, but face deployment challenges from limited memory capacity and bandwidth.Low-bit weight quantization can save memory and accelerate inference.Although floating-point (FP) formats show good performance in LLM quantization, they tend to perform poorly with small group sizes or sub-4 bits.We find the reason is that the absence of asymmetry in previous FP quantization makes it unsuitable for handling asymmetric value distribution of LLM weight tensors.In this work, we propose asymmetric FP quantization (AFPQ), which sets separate scales for positive and negative values.Our method leads to large accuracy improvements and can be easily plugged into other quantization methods, including GPTQ and AWQ, for better performance.Besides, no additional storage is needed compared with asymmetric integer (INT) quantization.The code is available at https://github.com/zhangsichengsjtu/AFPQ.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — floating point quantization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yijia Zhang , Sicheng Zhang , Shijie Cao , DaYou Du , Jianyu Wei , Ting Cao , Ningyi Xu

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Model Compression Deep Learning > Models > Large Language Models Deep Learning > Optimization & Theory > Model Compression

Keywords

model compression model quantization efficient inference weight quantization inference acceleration large language model floating point floating point quantization

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024