hiyouga / ChatGLM-Efficient-Tuning
- пятница, 30 июня 2023 г. в 00:00:03
Fine-tuning ChatGLM-6B with PEFT | 基于 PEFT 的高效 ChatGLM 微调
Fine-tuning
[ English | 中文 ]
[23/06/25] Now we align the demo API with the OpenAI's format where you can insert the fine-tuned model in arbitrary ChatGPT-based applications.
[23/06/25] Now we support fine-tuning the ChatGLM2-6B model with our framework! Try --use_v2
argument to fine-tune that model.
[23/06/05] Now we support 4-bit LoRA training (aka QLoRA). Try --quantization_bit 4
argument to work with 4-bit quantized model. (experimental feature)
[23/06/01] We implemented a framework supporting the efficient tuning of LLaMA and BLOOM models. Please follow LLaMA-Efficient-Tuning if you are interested.
[23/05/19] Now we support using the development set to evaluate the model while training. Try --dev_ratio
argument to specify the size of development set.
[23/04/29] Now we support training ChatGLM with Reinforcement Learning with Human Feedback (RLHF) ! We provide several examples to run RLHF training, please refer to the examples
folder for details.
[23/04/20] Our repo achieved 100 stars within 12 days! Congratulations!
[23/04/19] Now we support merging the weights of fine-tuned models trained by LoRA! Try --checkpoint_dir checkpoint1,checkpoint2
argument for continually fine-tuning the models.
[23/04/18] Now we support training the quantized models using three fine-tuning methods! Try quantization_bit
argument for training the model in 4/8 bits.
[23/04/12] Now we support training from checkpoints! Use --checkpoint_dir
argument to specify the checkpoint model to fine-tune from.
[23/04/11] Now we support training with combined datasets! Try --dataset dataset1,dataset2
argument for training with multiple datasets.
Our script now supports the following datasets:
Please refer to data/README.md for details.
Some datasets require confirmation before using them, so we recommend logging in with your HuggingFace account using these commands.
pip install --upgrade huggingface_hub
huggingface-cli login
Our script now supports the following fine-tuning methods:
And powerful GPUs!
Please refer to data/example_dataset
for checking the details about the format of dataset files. You can either use a single .json
file or a dataset loading script with multiple files to create a custom dataset.
Note: please update data/dataset_info.json
to use your custom dataset. About the format of this file, please refer to data/README.md
.
git clone https://github.com/hiyouga/ChatGLM-Efficient-Tuning.git
conda create -n chatglm_etuning python=3.10
conda activate chatglm_etuning
cd ChatGLM-Efficient-Tuning
pip install -r requirements.txt
If you want to enable LoRA or Freeze quantization on Windows, you will be required to install a pre-built version of bitsandbytes
library, which supports CUDA 11.6 or 11.7.
pip install https://github.com/acpopescu/bitsandbytes/releases/download/v0.37.2-win.1/bitsandbytes-0.37.2-py3-none-any.whl
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
--do_train \
--dataset alpaca_gpt4_en \
--finetuning_type lora \
--output_dir path_to_sft_checkpoint \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 3.0 \
--fp16
Please refer to our Wiki about the details of the arguments.
accelerate config # configure the environment
accelerate launch src/train_sft.py # arguments (same as above)
Note: if you are using LoRA method at fine-tuning, please provide --ddp_find_unused_parameters False
argument to avoid the runtime error.
CUDA_VISIBLE_DEVICES=0 python src/train_rm.py \
--do_train \
--dataset comparison_gpt4_en \
--finetuning_type lora \
--output_dir path_to_rm_checkpoint \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 1e-5 \
--num_train_epochs 1.0 \
--fp16
CUDA_VISIBLE_DEVICES=0 python src/train_ppo.py \
--do_train \
--dataset alpaca_gpt4_en \
--finetuning_type lora \
--checkpoint_dir path_to_sft_checkpoint \
--reward_model path_to_rm_checkpoint \
--output_dir path_to_ppo_checkpoint \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 1e-5 \
--num_train_epochs 1.0 \
--fp16
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
--do_eval \
--dataset alpaca_gpt4_en \
--checkpoint_dir path_to_checkpoint \
--output_dir path_to_eval_result \
--per_device_eval_batch_size 8 \
--max_samples 50 \
--predict_with_generate
CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
--do_predict \
--dataset alpaca_gpt4_en \
--checkpoint_dir path_to_checkpoint \
--output_dir path_to_predict_result \
--per_device_eval_batch_size 8 \
--max_samples 50 \
--predict_with_generate
python src/cli_demo.py \
--checkpoint_dir path_to_checkpoint
python src/web_demo.py \
--checkpoint_dir path_to_checkpoint
python src/export_model.py \
--checkpoint_dir path_to_checkpoint \
--output_dir path_to_export
Fine-tune method | Batch size | Mode | GRAM | Speed |
---|---|---|---|---|
LoRA (r=8) | 16 | FP16 | 28GB | 8ex/s |
LoRA (r=8) | 8 | FP16 | 24GB | 8ex/s |
LoRA (r=8) | 4 | FP16 | 20GB | 8ex/s |
LoRA (r=8) | 4 | INT8 | 10GB | 8ex/s |
LoRA (r=8) | 4 | INT4 | 8GB | 8ex/s |
P-Tuning (p=16) | 4 | FP16 | 20GB | 8ex/s |
P-Tuning (p=16) | 4 | INT8 | 16GB | 8ex/s |
P-Tuning (p=16) | 4 | INT4 | 12GB | 8ex/s |
Freeze (l=3) | 4 | FP16 | 24GB | 8ex/s |
Freeze (l=3) | 4 | INT8 | 12GB | 8ex/s |
RM method | Batch size | Mode | GRAM | Speed |
---|---|---|---|---|
LoRA (r=8) + rm | 4 | FP16 | 22GB | - |
LoRA (r=8) + rm | 1 | INT8 | 11GB | - |
RLHF method | Batch size | Mode | GRAM | Speed |
---|---|---|---|---|
LoRA (r=8) + ppo | 4 | FP16 | 23GB | - |
LoRA (r=8) + ppo | 1 | INT8 | 12GB | - |
Note:
r
is the lora rank,p
is the number of prefix tokens,l
is the number of trainable layers,ex/s
is the examples per second at training. Thegradient_accumulation_steps
is set to1
. All are evaluated on a single Tesla V100 (32G) GPU, they are approximated values and may vary in different GPUs.
We use the whole alpaca_gpt4_zh
dataset to fine-tune the ChatGLM model with LoRA (r=8) for one epoch, using the default hyper-parameters. The loss curve during training is presented below.
We select 100 instances in the alpaca_gpt4_zh
dataset to evaluate the fine-tuned ChatGLM model and compute the BLEU and ROUGE scores. The results are presented below.
Score | Original | FZ (l=2) | PT (p=16) | LoRA (r=8) |
---|---|---|---|---|
BLEU-4 | 15.75 | 16.85 | 16.06 | 17.01 (+1.26) |
Rouge-1 | 34.51 | 36.62 | 34.80 | 36.77 (+2.26) |
Rouge-2 | 15.11 | 17.04 | 15.32 | 16.83 (+1.72) |
Rouge-l | 26.18 | 28.17 | 26.35 | 28.86 (+2.68) |
Params (%) | / | 4.35% | 0.06% | 0.06% |
FZ: freeze tuning, PT: P-Tuning V2 (we use
pre_seq_len=16
for fair comparison with LoRA), Params: the percentange of trainable parameters.
This repository is licensed under the Apache-2.0 License. Please follow the Model License to use ChatGLM-6B model.
If this work is helpful, please cite as:
@Misc{chatglm-efficient-tuning,
title = {ChatGLM Efficient Tuning},
author = {hiyouga},
howpublished = {\url{https://github.com/hiyouga/ChatGLM-Efficient-Tuning}},
year = {2023}
}
This repo benefits from ChatGLM-6B, ChatGLM-Tuning and yuanzhoulvpi2017/zero_nlp. Thanks for their wonderful works.