What is mannequin quantization? Smaller, sooner LLMs

June 2, 2024

103

If ever there have been a salient instance of a counter-intuitive approach, it might be quantization of neural networks. Quantization reduces the precision of the weights and different tensors in neural community fashions, usually drastically. It’s no shock that lowering the precision of weights and different parameters from, say, 32-bit floats to 8-bit integers, makes the mannequin run sooner, and permits it to run in much less highly effective processors with far much less reminiscence. The gorgeous, counter-intuitive discovering is that quantization could be achieved whereas largely preserving the accuracy of the mannequin.

Why do we want quantization? The present giant language fashions (LLMs) are huge. The finest fashions have to run on a cluster of server-class GPUs; gone are the times the place you could possibly run a state-of-the-art mannequin regionally on one GPU and get fast outcomes. Quantization not solely makes it doable to run a LLM on a single GPU, it lets you run it on a CPU or on an edge machine.

Post-training quantization

Post-training quantization is a conversion approach that may scale back mannequin measurement whereas additionally bettering CPU and {hardware} accelerator latency, with little degradation in mannequin accuracy.

– TensorFlow Lite documentation

Given how mature TensorFlow Lite is in comparison with, say, the Gen AI mannequin du jour (most likely Mistral AI’s Codestral, which was launched the day I wrote this), it’s value how TensorFlow Lite implements quantization. First of all, TensorFlow Lite implements three choices for quantization:

Technique	Benefits	Hardware
Dynamic vary quantization	4x smaller, 2x to 3x speedup	CPU
Full integer quantization	4x smaller, 3x+ speedup	CPU, Edge TPU, Microcontrollers
Float16 quantization	2x smaller, GPU acceleration	CPU, GPU

In the choice tree that accompanies this desk, the TensorFlow Lite documenters define the issues for selecting a quantization approach. It’s value studying via the logic. In a nutshell, the most effective post-quantization methodology to your use case will rely in your {hardware} help for integer or floating level operations and whether or not you may present a consultant information set for calibration.

Dynamic vary quantization

Then they clarify why dynamic vary quantization is the standard start line: It offers diminished reminiscence utilization and sooner computation with out requiring you to offer a consultant information set for calibration. Dynamic vary quantization statically quantizes solely the weights from floating…

Source hyperlink

Post Views: 183

What is mannequin quantization? Smaller, sooner LLMs

Post-training quantization

Dynamic vary quantization

LEAVE A REPLY Cancel reply

EVEN MORE NEWS

Facebook now lets creators receives a commission for views on tales

Samsung Advances AI in Mobile Networks With NVIDIA – Samsung

Does Google’s Gemma 3 Outperform DeepSeek-V3?

POPULAR CATEGORY

Post-training quantization

Dynamic vary quantization

RELATED ARTICLESMORE FROM AUTHOR

OpenAI Releases GPT-4.5, a “Warm” Generative AI Model, for P…

Google’s new AI video mannequin Veo 2 will price 50 cents per

Perplexity 1776 Model Fixes DeepSeek-R1’s “Refusal to Respon…

LEAVE A REPLY Cancel reply

EVEN MORE NEWS

Facebook now lets creators receives a commission for views on tales

Samsung Advances AI in Mobile Networks With NVIDIA – Samsung

Does Google’s Gemma 3 Outperform DeepSeek-V3?

POPULAR CATEGORY

RELATED ARTICLES MORE FROM AUTHOR