PhoebusSi / Alpaca-CoT
- четверг, 30 марта 2023 г. в 00:14:30
We extend CoT data to Alpaca to boost its reasoning ability. We are constantly expanding our collection of instruction-tuning data. The instruction collection can be found at https://huggingface.co/datasets/QingyiSi/Alpaca-CoT/tree/main (我们将CoT数据扩展到Alpaca以提高其推理能力,同时我们将不断收集更多的instruction-tuning数据集。)
中文README,请看这里。(Chinese READEME can be found here.)
This is the repository for the Evolving Alpaca
project, which aims to extensively collect instruction-tuning datasets (especially the CoT datasets) and conduct an in-depth empirical study based on LLaMA model [1]. Evolving
is used to describe the continuous expansion of our instruction-tuning data collection, which will continuously enhance Alpaca's [2] instruction-following capabilities.
You are in a warm welcome to provide us with any non-collected instruction-tuning datasets (or their sources). We will uniformly format them, train Alpaca model (and other LLMs in the early future) with these datasets, open source the model checkpoints, and conduct extensive empirical studies. We hope that our project can make a modest contribution to the open-source process of large language models, and reduce its threshold for NLP researchers to get started.
LLaMA [1] is a great work that demonstrates the amazing zero-shot and few-shot ability. It significantly reduces the cost of training, finetuning, and using competitive large language models, i.e., LLaMA-13B outperforms GPT-3(175B) and LLaMA-65B is competitive to PaLM-540M. Recently, to boost the instruction-following ability of LLaMA, Stanford Alpaca [2] finetuned LLaMA-7B on 52K instruction-following data generated by the Self-Instruct [3] techniques. However, at present, the LLM research community still faces two challenges: 1. Even LLaMA still has high requirements for computing resources, and 2. There are not many open source datasets for instruction finetuning.
To this end, we propose this project, which leverages various improvements that were subsequently proposed, with the following advantages:
7b
, 13b
and 30b
versions of LLaMA models can be easily trained on a single 80G A100.To the best of our knowledge, this work is the first to study CoT reasoning based on LLaMA and Alpaca. Therefore, we abbreviate our work to Alpaca-CoT
.
[1]: LLaMA: Open and Efficient Foundation Language Models
[2]: Stanford Alpaca: An Instruction-following LLaMA model
[3]: Self-Instruct: Aligning Language Model with Self Generated Instructions
[4]: LoRA: Low-Rank Adaptation of Large Language Models
[5]: FLAN: Scaling Instruction-Finetuned Language Models
[6]: BELLE: Bloom-Enhanced Large Language model Engine
The current collection of instruction-finetuning datasets consists mainly of three parts:
alpaca_data_cleaned.json
: about 52K English instruction-following training samples.belle_data_cn.json
: about 0.5M Chinese |instruction-following training samples.CoT_data.json
: 9 CoT datasets involving about 75k samples.More details on the usage and sources of different datasets can be found here.
You can download all the formatted data here. Then you should put them in the data folder.
You can download all checkpoints trained on various types of instruction data from here. Then, after setting LoRA_Weights
(in generate.py
) to the local path, you can directly execute the model inference.
All data in our collection is formatted into the same templates, where each sample is as follows:
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
Note that, for CoT datasets, we first use the template provided by FLAN to change the original dataset into various Chain-of-Thoughts forms, and then convert it to the above format. The formatting script can be found here.
pip install -r requirements.txt
Single GPU
## --data
# alpaca-cot: reasoning-enhanced version
# alpaca-belle: Chinese-enhanced version
# alpaca-belle-cot: full-data version
## --size
# [7, 13, 30, 65]
python3 finetune.py --size 7 --data alpaca-belle-cot
Multiple GPUs
## --data
# alpaca-cot: reasoning-enhanced version
# alpaca-belle: Chinese-enhanced version
# alpaca-belle-cot: full-data version
## --size
# [7, 13, 30, 65]
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy finetune.py --size 7 --data alpaca-belle-cot
## --data
# alpaca-cot: reasoning-enhanced version
# alpaca-belle: Chinese-enhanced version
# alpaca-belle-cot: full-data version
## --size
# [7, 13, 30, 65]
python3 generate.py --size 7 --data alpaca-belle-cot
More details of instruction finetuing and inference can be found here where we modified from. Note that the folders saved-xxx7b
are the save path for LoRA weights, and LLaMA weights are automatically downloaded from Hugging Face.
"w/o CoT" and "w/o CN" denote models that exclude CoT data and Chinese instructions from their instruction finetuning data, respectively.
The above table shows two examples (invoving with numerical calculations) that require a certain amount of reasoning ability to respond correctly.
As shown in the middle column, Ours w/o CoT
fails to generate the correct response, which shows that once the finetuning data does not contain CoT data, the model's reasoning ability significantly decreases. This further demonstrates that CoT data is essential for LLM models.
The above table shows two examples that require the ability to respond to Chinese instructions.
As shown in the right column, either the generated content of Ours w/o CN
is unreasonable, or the Chinese instructions are answered in English by Ours w/o CN
. This shows that removing Chinese data during finetuning will cause the model to be unable to handle Chinese instructions, and further demonstrates the need to collect Chinese instruction finetuning data.
The above table shows a relatively difficult example, which requires both a certain accumulation of knowledge of Chinese history and a logical and complete ability to state historical events. As shown in this table, Ours w/o CN
can only generate a short and erroneous response, because due to the lack of Chinese finetuning data, the corresponding knowledge of Chinese history is naturally lacking. Although Ours w/o CoT
lists some relevant Chinese historical events, its logic of expression is self-contradictory, which is caused by the lack of CoT data.
`
In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.
Samples of each odd number of rows do not apply the CoT prompt, such as "step-by-step reasoning." Both
Ours(w/CoT)
and Alpaca are based on LLaMA-7B, and the only difference between them two is that the instruction-finetuning data of Ours(w/CoT)
has a extra CoT data than that of Alpaca.
From the above table, we find that:
Ours(w/CoT)
always generates the correct rationale before the answer, while Alpaca fails to generate any reasonable rationale, as shown in the first 4 examples (commonsense questions). This shows that using CoT data for finetuning can significantly improve reasoning ability.Ours(w/CoT)
, the CoT prompt (e.g., concatenate 'step-by-step' with the input question) has little effect on easy examples (e.g., commonsense questions) and has an important effect on challenging questions (e.g., questions requiring reasoning, like the last four examples).Quantitative comparison of responses to Chinese instructions.
Our model is finetuned from a 7B LLaMA on 52K English instructions and 0.5M Chinese instructions. Stanford Alpaca (our reimplementation) is finetuned from a 7B LLaMA on 52K English instructions. BELLE is finetuned from a 7B BLOOM on 2B Chinese instructions.
From the above table, several observations can be found:
ours (w/ CN)
has a stronger ability to understand Chinese instructions. For the first example, Alpaca fails to distinguish between the instruction
part and input
part, while we do.ours (w/ CN)
not only provides the correct code, but also provides the corresponding Chinese annotation, while Alpaca does not. In addition, as shown in the 3-5 examples, Alpaca can only respond to Chinese instruction with an English response.ours (w/ CN)
's performance on instructions requiring an open response (as shown in last two examples) still needs to be improved. BELLE's outstanding performance against such instructions is due to: 1. Its BLOOM backbone model encounters much more multilingual data during pre-training; 2. Its Chinese instruction finetuning data is more than ours, that is, 2M vs 0.5M.Quantitative comparison of responses to English instructions. The purpose of this subsection is to explore whether finetuning on Chinese instructions has a negative impact on Alpaca.
From the above table, we find that:
ours (w/ CN)
shows more detail than that of Alpaca, e.g. for the third example, ours (w/ CN)
list three more provinces than Alpaca.Please cite the repo if you use the data collection, code, and experimental findings in this repo.
@misc{alpaca-cot,
author = {Qingyi Si, Zheng Lin },
school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
title = {Evolving Alpaca: An Empirical Study on Instruction Tuning for Large Language Models},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}
For data, please cite the original Stanford Alpaca, BELLE and FLAN papers as well.
For models, please cite the original LLaMA, Stanford Alpaca, Self-Instruct and LoRA papers as well.