pytorch / ao
- четверг, 3 октября 2024 г. в 00:00:02
PyTorch native quantization and sparsity for training and inference
Introduction | Inference | Training | Composability | Custom Kernels | Alpha Features | Installation | Integrations | Videos | License
torchao: PyTorch library for custom data types & optimizations. Quantize and sparsify weights, gradients, optimizers & activations for inference and training.
From the team that brought you the fast series
torchao just works with torch.compile()
and FSDP2
over most PyTorch models on Huggingface out of the box.
Quantizing and Sparsifying your models is a 1 liner that should work on any model with an nn.Linear
including your favorite HuggingFace model. You can find a more comprehensive usage instructions here, sparsity here and a HuggingFace inference example here
For inference, we have the option of
from torchao.quantization.quant_api import (
quantize_,
int8_dynamic_activation_int8_weight,
int4_weight_only,
int8_weight_only
)
quantize_(m, int4_weight_only())
For gpt-fast int4_weight_only()
is the best option at bs=1 as it 2x the tok/s and reduces the VRAM requirements by about 65% over a torch.compiled baseline.
If you don't have enough VRAM to quantize your entire model on GPU and you find CPU quantization to be too slow then you can use the device argument like so quantize_(model, int8_weight_only(), device="cuda")
which will send and quantize each layer individually to your GPU.
If you see slowdowns with any of these techniques or you're unsure which option to use, consider using autoquant which will automatically profile layers and pick the best way to quantize each layer.
model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
We also provide a developer facing API so you can implement your own quantization algorithms so please use the excellent HQQ algorithm as a motivating example.
We've added kv cache quantization and other features in order to enable long context length (and necessarily memory efficient) inference.
In practice these features alongside int4 weight only quantization allow us to reduce peak memory by ~55%, meaning we can Llama3.1-8B inference with a 130k context length with only 18.9 GB of peak memory. More details can be found here
Post-training quantization can result in a fast and compact model, but may also lead to accuracy degradation. We recommend exploring Quantization Aware Training (QAT) to overcome this limitation. In collaboration with Torchtune, we've developed a QAT recipe that demonstrates significant accuracy improvements over traditional PTQ, recovering 96% of the accuracy degradation on hellaswag and 68% of the perplexity degradation on wikitext for Llama3 compared to post-training quantization (PTQ). And we've provided a full recipe here
from torchao.quantization.prototype.qat import Int8DynActInt4WeightQATQuantizer
qat_quantizer = Int8DynActInt4WeightQATQuantizer()
# Insert "fake quantize" operations into linear layers.
# These operations simulate quantization numerics
model = qat_quantizer.prepare(model)
# Run Training...
# Convert fake quantize to actual quantize operations
model = qat_quantizer.convert(model)
torchao.float8 implements training recipes with the scaled float8 dtypes, as laid out in https://arxiv.org/abs/2209.05433.
With torch.compile
on, current results show throughput speedups of up to 1.5x on 128 H100 GPU LLaMa 3 70B pretraining jobs (details)
from torchao.float8 import convert_to_float8_training
convert_to_float8_training(m, module_filter_fn=...)
And for an end-to-minimal training recipe of pretraining with float8, you can check out torchtitan
We've added support for semi-structured 2:4 sparsity with 6% end-to-end speedups on ViT-L. Full blog here
The code change is a 1 liner with the full example available here
swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
ADAM takes 2x as much memory as the model params so we can quantize the optimizer state to either 8 or 4 bit effectively reducing the optimizer VRAM requirements by 2x or 4x respectively over an fp16 baseline
from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit, AdamWFp8
optim = AdamW8bit(model.parameters()) # replace with Adam4bit and AdamFp8 for the 4 / fp8 versions
In practice, we are a tiny bit slower than expertly written kernels but the implementations for these optimizers were written in a few hundred lines of PyTorch code and compiled so please use them or copy-paste them for your quantized optimizers. Benchmarks here
We also have support for single GPU CPU offloading where both the gradients (same size as weights) and the optimizers will be efficiently sent to the CPU. This alone can reduce your VRAM requirements by 60%
optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)
optim.load_state_dict(ckpt["optim"])
torch.compile
: A key design principle for us is composability as in any new dtype or layout we provide needs to work with our compiler. It shouldn't matter if the kernels are written in pure PyTorch, CUDA, C++, or Triton - things should just work! So we write the dtype, layout, or bit packing logic in pure PyTorch and code-generate efficient kernels.The best example we have combining the composability of lower bit dtype with compile and fsdp is NF4 which we used to implement the QLoRA algorithm. So if you're doing research at the intersection of this area we'd love to hear from you.
We've added support for authoring and releasing custom ops that do not graph break with torch.compile()
so if you love writing kernels but hate packaging them so they work all operating systems and cuda versions, we'd love to accept contributions for your custom ops. We have a few examples you can follow
quantize_(model, fpx_weight_only(3, 2))
If you believe there's other CUDA kernels we should be taking a closer look at please leave a comment on this issue
Things we're excited about but need more time to cook in the oven
quantize_(model, int8_weight_only_quantized_training())
. This work is prototype as the memory benchmarks are not compelling yet.torchao
makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
Stable release from Pypi which will default to CUDA 12.1
pip install torchao
Stable Release from the PyTorch index
pip install torchao --extra-index-url https://download.pytorch.org/whl/cu121 # full options are cpu/cu118/cu121/cu124
Nightly Release
pip install --pre torchao --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
For most developers you probably want to skip building custom C++/CUDA extensions for faster iteration
USE_CPP=0 pip install -e .
We're also fortunate to be integrated into some of the leading open-source libraries including
torchao
is released under the BSD 3 license.