MPPQ: Enhancing Post-Training Quantization for LLMs via Mixed Supervision, Proxy Rounding, and Pre-Searching

MPPQ: Enhancing Post-Training Quantization for LLMs via Mixed Supervision, Proxy Rounding, and Pre-Searching

Mingrun Wei, Yeyu Yan, Dong Wang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 8277-8285. https://doi.org/10.24963/ijcai.2025/920

Recently, post-training quantization (PTQ) methods for large language models (LLMs) primarily focus on tackling the challenges caused by outliers. Scaling transformation has proven to be effective while how to enhance the performance of extremely low-bitwidth (e.g., 2-bit) PTQ under it remains largely unexplored. In this work, a new PTQ framework, namely MPPQ, is established. Specifically, MPPQ first proposes an enhanced reconstruction loss based on Mixed metric supervision to mitigate the distribution inconsistency caused by quantization while providing strong regularization for learnable parameters. Secondly, we introduce a Proxy-based adaptive rounding scheme in weight quantization, which replaces the round-to-nearest (RTN) function to minimize the overall quantization errors through element-wise scaling. Furthermore, a factor coarse Pre-searching mechanism is presented to ensure proper coordination between quantization and clipping patterns, while achieving optimal initialization of clipping factors before training. Extensive experiments show that MPPQ consistently outperforms state-of-the-art methods in low-bit quantization settings. For instance, the perplexity of WikiText2 can be dramatically reduced to 8.85 (3.9 ↓ vs 12.75 of the latest method, LRQuant) for the LLaMA-2-7B model, which is quantized with W4A4.
Keywords:
Natural Language Processing: NLP: Language models
Machine Learning: ML: Applications
Machine Learning: ML: Optimization
Natural Language Processing: NLP: Applications