OpenBMB / ToolBench
- суббота, 5 августа 2023 г. в 00:00:08
An open platform for training, serving, and evaluating large language model for tool learning.
Model • Data Release • Web Demo • Tool Eval • Paper • Citation
[2023/8/4] We provide RapidAPI backend service to free you from using your own RapidAPI key and subscribing the APIs. Please fill out our form. We will review it as soon as possible and send you the ToolBench key to get start on it!
[2023/8/1] Our paper is released.
[2023/7/27] New version ToolBench is released.
We also provide A demo of using ToolLLaMA
Currently, our ToolLLaMA has reached the performance of ChatGPT (turbo-16k) in tool use, in the future, we will continually improve the data quality and increase the coverage of real-world tools.
Here is the Old version of ToolBench.
Tool Nums | API Nums | Instance Nums | Real API Call | Reasoning Traces |
---|---|---|---|---|
3451 | 16464 | 12657 | 37204 | 4.1 |
We crawl 16000+ real-world APIs from RapidAPI, and curate realistic human instructions that involve them. Below we present a hierarchy of RapidAPI and our instruction generation process.
ToolBench contains both single-tool and multi-tool scenarios. The multi-tool scenarios can be further categorized into intra-category multi-tool and intra-collection multi-tool. We utilize DFSDT method for all scenarios to our data creation. Here is an illustration for the data creation process using DFSDT method:
Please download our dataset using the following link: Google Drive or Tsinghua Cloud.
G1
,G2
, G3
data refers to single-tool, intra-category multi-tool and intra-collection multi-tool data respectively. We also have an Atlas Explorer for visualization.toolllama_G123_dfs_train.json
refers to the combined train data.toolenv
directory.test_query_ids
directory contains query ids of the test instances in each test set.retrieval
directory.We release the ToolLLaMA-7b and ToolLLaMA-7b-LoRA models, which are both trained on the released dataset in a multi-task fashion. We also release the tool retriever trained under our experimental setting.
Clone this repository and navigate to the ToolBench folder.
git clone git@github.com:OpenBMB/ToolBench.git
cd ToolBench
Install Package (python>=3.9)
pip install -r requirements.txt
or for ToolEval only
pip install -r toolbench/tooleval/requirements.txt
Prepare the data and tool environment:
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1Vis-RxBstXLKC1W1agIQUJNuumPJrrw0&confirm=yes' -O data.zip
unzip data.zip
export PYTHONPATH=./
python data/preprocess_retriever_data.py \
--query_file data/instruction/G1_query.json \
--index_file data/test_query_ids/G1_instruction_test_query_ids.json \
--dataset_name G1 \
--output_dir data/retrieval/G1
export PYTHONPATH=./
python toolbench/retrieval/train.py \
--data_path data/retrieval/G1/ \
--model_name bert-base-uncased \
--output_path retrieval_model \
--num_epochs 5 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--warmup_steps 500 \
--max_seq_length 256
Our training code is based on FastChat. You can use the following command to train ToolLLaMA-7b with 2 x A100 (80GB), with the preprocessed data in our data link:
export PYTHONPATH=./
torchrun --nproc_per_node=2 --master_port=20001 toolbench/train/train_long_seq.py \
--model_name_or_path huggyllama/llama-7b \
--data_path data/toolllama_G123_dfs_train.json \
--eval_data_path data/toolllama_G123_dfs_eval.json \
--conv_template tool-llama-single-round \
--bf16 True \
--output_dir toolllama \
--num_train_epochs 2 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "epoch" \
--prediction_loss_only \
--save_strategy "epoch" \
--save_total_limit 8 \
--learning_rate 5e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
--tf32 True \
--model_max_length 8192 \
--gradient_checkpointing True \
--lazy_preprocess True \
--report_to none
You can also preprocess and split the data in your own way with this command:
export PYTHONPATH=./
python preprocess/preprocess_toolllama_data.py \
--tool_data_dir data/answer/G1_answer \
--method DFS_woFilter_w2 \
--output_file data/answer/toolllama_G1_dfs.json
To train lora version:
export PYTHONPATH=./
deepspeed --master_port=20001 toolbench/train/train_long_seq_lora.py \
--model_name_or_path huggyllama/llama-7b \
--data_path data/toolllama_G123_dfs_train.json \
--eval_data_path data/toolllama_G123_dfs_eval.json \
--conv_template tool-llama-single-round \
--bf16 True \
--output_dir toolllama_lora \
--num_train_epochs 5 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "epoch" \
--prediction_loss_only \
--save_strategy "epoch" \
--save_total_limit 8 \
--learning_rate 5e-5 \
--weight_decay 0. \
--warmup_ratio 0.04 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 8192 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed ds_configs/stage2.json \
--report_to none
Please fill out the form first and after reviewing we will send you the toolbench key. Then prepare your toolbench key by:
export TOOLBENCH_KEY="your_toolbench_key"
To inference with ToolLLaMA, run the following commands:
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
--tool_root_dir data/toolenv/tools/ \
--backbone_model toolllama \
--model_path ToolBench/ToolLLaMA-7b \
--max_observation_length 1024 \
--observ_compress_method truncate \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo.json \
--output_answer_file data/answer/toolllama_dfs \
--toolbench_key $TOOLBENCH_KEY
For ToolLLaMA-LoRA:
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
--tool_root_dir data/toolenv/tools/ \
--backbone_model toolllama \
--model_path huggyllama/llama-7b \
--lora \
--lora_path /path/to/your/downloaded/ToolLLaMA-7b-LoRA \
--max_observation_length 1024 \
--observ_compress_method truncate \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo.json \
--output_answer_file data/answer/toolllama_lora_dfs \
--toolbench_key $TOOLBENCH_KEY
For ToolLLaMA-LoRA under open-domain setting, run:
export PYTHONPATH=./
python toolbench/inference/qa_pipeline_open_domain.py \
--tool_root_dir data/toolenv/tools/ \
--corpus_tsv_path data/retrieval/G1/corpus.tsv \
--retrieval_model_path /path/to/your/retrival_model \
--retrieved_api_nums 5 \
--backbone_model toolllama \
--model_path huggyllama/llama-7b \
--lora \
--lora_path /path/to/your/toolllama_lora \
--max_observation_length 1024 \
--observ_compress_method truncate \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo_open_domain.json \
--output_answer_file data/answer/toolllama_lora_dfs_open_domain \
--toolbench_key $TOOLBENCH_KEY
To use ChatGPT, run:
export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
--tool_root_dir data/toolenv/tools/ \
--backbone_model chatgpt_function \
--openai_key $OPENAI_KEY \
--max_observation_length 1024 \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo.json \
--output_answer_file data/answer/chatgpt_dfs \
--toolbench_key $TOOLBENCH_KEY
To use Text-Davinci-003, run:
export TOOLBENCH_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
--tool_root_dir data/toolenv/tools/ \
--backbone_model davinci \
--openai_key $OPENAI_KEY \
--max_observation_length 1024 \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo.json \
--output_answer_file data/answer/davinci_dfs \
--toolbench_key $TOOLBENCH_KEY
To do inference with customized RapidAPI account, pass your rapidapi key through rapidapi_key
and specify the use_rapidapi_key
argument in the script:
export RAPIDAPI_KEY=""
export OPENAI_KEY=""
export PYTHONPATH=./
python toolbench/inference/qa_pipeline.py \
--tool_root_dir data/toolenv/tools/ \
--backbone_model chatgpt_function \
--openai_key $OPENAI_KEY \
--max_observation_length 1024 \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo.json \
--output_answer_file data/answer/chatgpt_dfs \
--rapidapi_key $RAPIDAPI_KEY \
--use_rapidapi_key
ToolBench contains a Web UI based on Chatbot UI, forked to include the use of tools in the interface. It comes in two parts: the backend server, and chatbot-ui-toolllama. Here is a video demo.
git clone https://github.com/lilbillybiscuit/chatbot-ui-toolllama
cd chatbot-ui-toolllama
npm install
npm run dev
The app will be available on http://localhost:3000/
export PYTHONPATH=./
python toolbench/inference/toolbench_server.py \
--tool_root_dir data/toolenv/tools/ \
--corpus_tsv_path data/retrieval/G1/corpus.tsv \
--retrieval_model_path /path/to/your/retrival_model \
--retrieved_api_nums 5 \
--backbone_model toolllama \
--model_path huggyllama/llama-7b \
--lora \
--lora_path /path/to/your/toolllama_lora \
--max_observation_length 1024 \
--method DFS_woFilter_w2 \
--input_query_file data/instruction/inference_query_demo_open_domain.json \
--output_answer_file data/answer/toolllama_lora_dfs_open_domain \
--rapidapi_key $RAPIDAPIKEY
This server will be available on http://localhost:5000/
. To start a request, call http://localhost:5000/stream
with a GET or POST request containing a JSON object with the following fields:
{
"text": "What is the weather in New York today?",
"top_k": 5,
"method": "DFS_woFilter_w2"
}
By fine-tuning LLaMA on ToolBench, we obtain ToolLLaMA. Considering that human evaluation can be time-consuming, we follow AlpacaEval to develop an efficient machine evaluator ToolEval, which incorporates two evaluation metrics:
To validate the effectiveness of the metric Preference, we sample among three different methods (ChatGPT+ReACT, GPT4+ReACT, and ChatGPT+DFSDT) to obtain answer pairs for 600 test instructions. Then we engage humans to annotate human preference for them (4 annotations for each answer pair, 2400 annotations in total). Our automatic evaluator, developed using ChatGPT, demonstrates a significant correlation of 75.8% with human annotators. We also obtain the agreement among different human annotators 83.54%, and the agreement between humans and our evaluator 80.21%.
More details about ToolEval can be found in our paper.
To evaluate a model on G1-Inst. test set, for example, run the following commands.
python toolbench/tooleval/pass_rate.py --answer_dir data/answer/toolllama_dfs/G1_instruction
export OPENAI_KEY=""
export REF_MODEL_DATA="data/answer/chatgpt_cot/G1_instruction"
export REF_MODEL_METHOD="CoT"
export TEST_MODEL_DATA="data/answer/toolllama_dfs/G1_instruction"
export TEST_MODEL_METHOD="DFS"
python ./toolbench/tooleval/convert_to_answer_format.py \
--method CoT \
--answer_dir $REF_MODEL_DATA \
--output ${REF_MODEL_DATA}_converted
python ./toolbench/tooleval/convert_to_answer_format.py \
--method DFS \
--answer_dir $TEST_MODEL_DATA \
--output ${TEST_MODEL_DATA}_converted
python ./toolbench/tooleval/automatic_eval_sample.py \
--output ${REF_MODEL_DATA}_converted \
--ref_output ${TEST_MODEL_DATA}_converted \
--method $REF_MODEL_METHOD \
--use_existed_output
In our main experiments, ToolLLaMA demonstrates a compelling capability to handle both single-tool and complex multi-tool instructions. Below are the main results compared with ChatGPT and Text-Davinci-003.
Pass Rate:
model | I1-Inst. | I1-Tool. | I1-Cat. | I2-Inst. | I2-Cat. | I3-Inst. | Average |
---|---|---|---|---|---|---|---|
ChatGPT-DFSDT | 78 | 84 | 89 | 51 | 58 | 57 | 69.6 |
ChatGPT-ReACT | 56 | 62 | 66 | 28 | 22 | 30 | 44.0 |
Text-Davinci-003-DFSDT | 53 | 58 | 61 | 38 | 38 | 39 | 47.8 |
Text-Davinci-003-ReACT | 19 | 25 | 30 | 12 | 11 | 14 | 18.5 |
ToolLLaMA | 68 | 80 | 75 | 47 | 56 | 40 | 61.0 |
ToolLLaMA-LoRA | 51 | 63 | 61 | 38 | 42 | 45 | 50.0 |
ToolLLaMA-API Retriever | 62 | 62 | 72 | 45 | 55 | 47 | 57.2 |
Win Rate: (Reference model: ChatGPT-DFSDT)
model | I1-Inst. | I1-Tool. | I1-Cat. | I2-Inst. | I2-Cat. | I3-Inst. | Average |
---|---|---|---|---|---|---|---|
ChatGPT-DFSDT | 50 | 50 | 50 | 50 | 50 | 50 | 50.0 |
ChatGPT-ReACT | 38 | 32 | 41 | 43 | 22 | 23 | 30.7 |
Text-Davinci-003-ReACT | 14 | 21 | 18 | 8 | 7 | 12 | 13.3 |
Text-Davinci-003-DFSDT | 38 | 34 | 43 | 25 | 20 | 28 | 31.3 |
ToolLLaMA | 50 | 45 | 45 | 59 | 48 | 46 | 48.8 |
ToolLLaMA-LoRA | 43 | 36.4 | 30 | 42 | 45 | 51 | 41.2 |
ToolLLaMA-API Retriever | 51 | 39 | 44 | 49 | 49 | 55 | 47.8 |
With the powerful capabilities of foundation models, we are eager to see their applications in manipulating various tools. For more resources, please refer to the following:
BMTools. [Project]
Tool Learning Survey. [Paper]
Tool Learning Paper List. [Project]
WebCPM. [Paper]
Feel free to cite us if you like ToolBench.
@misc{qin2023toolllm,
title={ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs},
author={Yujia Qin and Shihao Liang and Yining Ye and Kunlun Zhu and Lan Yan and Yaxi Lu and Yankai Lin and Xin Cong and Xiangru Tang and Bill Qian and Sihan Zhao and Runchu Tian and Ruobing Xie and Jie Zhou and Mark Gerstein and Dahai Li and Zhiyuan Liu and Maosong Sun},
year={2023},
eprint={2307.16789},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@misc{qin2023tool,
title={Tool Learning with Foundation Models},
author={Yujia Qin and Shengding Hu and Yankai Lin and Weize Chen and Ning Ding and Ganqu Cui and Zheni Zeng and Yufei Huang and Chaojun Xiao and Chi Han and Yi Ren Fung and Yusheng Su and Huadong Wang and Cheng Qian and Runchu Tian and Kunlun Zhu and Shihao Liang and Xingyu Shen and Bokai Xu and Zhen Zhang and Yining Ye and Bowen Li and Ziwei Tang and Jing Yi and Yuzhang Zhu and Zhenning Dai and Lan Yan and Xin Cong and Yaxi Lu and Weilin Zhao and Yuxiang Huang and Junxi Yan and Xu Han and Xian Sun and Dahai Li and Jason Phang and Cheng Yang and Tongshuang Wu and Heng Ji and Zhiyuan Liu and Maosong Sun},
year={2023},
eprint={2304.08354},
archivePrefix={arXiv},
primaryClass={cs.CL}
}