pytorch / torchtune
- суббота, 20 апреля 2024 г. в 00:00:01
A Native-PyTorch Library for LLM Fine-tuning
torchtune now officially supports Meta Llama3! Check out our recipes for Llama3-8B with LoRA, QLoRA and Full fine-tune in the Llama3 section! 🚀 🦙
Introduction | Installation | Get Started | Documentation | Design Principles | Community Contributions | License
torchtune is a PyTorch-native library for easily authoring, fine-tuning and experimenting with LLMs. We're excited to announce our alpha release!
torchtune provides:
torchtune focuses on integrating with popular tools and libraries from the ecosystem. These are just a few examples, with more under development:
torchtune currently supports the following models.
Model | Sizes |
---|---|
Llama3 | 8B [models, configs] |
Llama2 | 7B, 13B, 70B [models, configs] |
Mistral | 7B [model, configs] |
Gemma | 2B [model, configs] |
We'll be adding a number of new models in the coming weeks, including support for 70B versions and MoEs.
torchtune provides the following fine-tuning recipes.
Training | Fine-tuning Method |
---|---|
Distributed Training [1 to 8 GPUs] | Full [code, example], LoRA [code, example] |
Single Device / Low Memory [1 GPU] | Full [code, example], LoRA + QLoRA [code, example] |
Single Device [1 GPU] | DPO [code, example] |
Memory efficiency is important to us. All of our recipes are tested on a variety of setups including commodity GPUs with 24GB of VRAM as well as beefier options found in data centers.
Single-GPU recipes expose a number of memory optimizations that aren't available in the distributed versions. These include support for low-precision optimizers from bitsandbytes and fusing optimizer step with backward to reduce memory footprint from the gradients (see example config). For memory-constrained setups, we recommend using the single-device configs as a starting point. For example, our default QLoRA config has a peak memory usage of ~9.3GB
. Similarly LoRA on single device with batch_size=2
has a peak memory usage of ~17.1GB
. Both of these are with dtype=bf16
and AdamW
as the optimizer.
This table captures the minimum memory requirements for our different recipes using the associated configs.
Example HW Resources | Finetuning Method | Config | Model | Peak Memory per GPU |
---|---|---|---|---|
1 x RTX 4090 | QLoRA | qlora_finetune_single_device | Llama2-7B | 8.57 GB |
2 x RTX 4090 | LoRA | lora_finetune_distributed | Llama2-7B | 20.95 GB |
1 x RTX 4090 | LoRA | lora_finetune_single_device | Llama2-7B | 17.18 GB |
1 x RTX 4090 | Full finetune | full_finetune_single_device | Llama2-7B | 14.97 GB |
4 x RTX 4090 | Full finetune | full_finetune_distributed | Llama2-7B | 22.9 GB |
torchtune supports fine-tuning for the Llama3 8B models with support for 70B on its way. We currently support LoRA, QLoRA and Full-finetune on a single GPU as well as LoRA and Full fine-tune on multiple devices. For all the details, take a look at our tutorial.
In our initial experiments, QLoRA has a peak allocated memory of ~9GB
while LoRA on a single GPU has a peak allocated memory of ~19GB
. To get started, you can use our default configs to kick off training.
tune run lora_finetune_single_device --config llama3/8B_lora_single_device
tune run lora_finetune_single_device --config llama3/8B_qlora_single_device
tune run --nproc_per_node 4 lora_finetune_distributed --config llama3/8B_lora
tune run --nproc_per_node 2 full_finetune_distributed --config llama3/8B_full
Step 1: Install PyTorch. torchtune is tested with the latest stable PyTorch release (2.2.2) as well as the preview nightly version.
Step 2: The latest stable version of torchtune is hosted on PyPI and can be downloaded with the following command:
pip install torchtune
To confirm that the package is installed correctly, you can run the following command:
tune --help
And should see the following output:
usage: tune [-h] {ls,cp,download,run,validate} ...
Welcome to the TorchTune CLI!
options:
-h, --help show this help message and exit
...
To get started with fine-tuning your first LLM with torchtune, see our tutorial on fine-tuning Llama2 7B. Our end-to-end workflow tutorial will show you how to evaluate, quantize and run inference with this model. The rest of this section will provide a quick overview of these steps with Llama2.
Follow the instructions on the official meta-llama
repository to ensure you have access to the Llama2 model weights. Once you have confirmed access, you can run the following command to download the weights to your local machine. This will also download the tokenizer model and a responsible use guide.
tune download meta-llama/Llama-2-7b-hf \
--output-dir /tmp/Llama-2-7b-hf \
--hf-token <HF_TOKEN> \
Tip: Set your environment variable
HF_TOKEN
or pass in--hf-token
to the command in order to validate your access. You can find your token at https://huggingface.co/settings/tokens
Llama2 7B + LoRA on single GPU:
tune run lora_finetune_single_device --config llama2/7B_lora_single_device
For distributed training, tune CLI integrates with torchrun. Llama2 7B + LoRA on two GPUs:
tune run --nproc_per_node 2 full_finetune_distributed --config llama2/7B_full
Tip: Make sure to place any torchrun commands before the recipe specification. Any CLI args after this will override the config and not impact distributed training.
There are two ways in which you can modify configs:
Config Overrides
You can easily overwrite config properties from the command-line:
tune run lora_finetune_single_device \
--config llama2/7B_lora_single_device \
batch_size=8 \
enable_activation_checkpointing=True \
max_steps_per_epoch=128
Update a Local Copy
You can also copy the config to your local directory and modify the contents directly:
tune cp llama2/7B_full ./my_custom_config.yaml
Copied to ./7B_full.yaml
Then, you can run your custom recipe by directing the tune run
command to your local files:
tune run full_finetune_distributed --config ./my_custom_config.yaml
Check out tune --help
for all possible CLI commands and options. For more information on using and updating configs, take a look at our config deep-dive.
torchtune embodies PyTorch’s design philosophy [details], especially "usability over everything else".
torchtune is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face Datasets, EleutherAI Eval Harness), all of the core functionality is written in PyTorch.
torchtune is designed to be easy to understand, use and extend.
torchtune provides well-tested components with a high-bar on correctness. The library will never be the first to provide a feature, but available features will be thoroughly tested. We provide
We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide.
The Llama2 code in this repository is inspired by the original Llama2 code.
We want to give a huge shout-out to EleutherAI, Hugging Face and Weights & Biases for being wonderful collaborators and for working with us on some of these integrations within torchtune.
We also want to acknowledge some awesome libraries and tools from the ecosystem:
torchtune is released under the BSD 3 license. However you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models.