lvwerra / trl
- вторник, 11 апреля 2023 г. в 00:13:54
Train transformer language models with reinforcement learning.
Train transformer language models with reinforcement learning.
With trl
you can train transformer language models with Proximal Policy Optimization (PPO). The library is built on top of the transformers
library by transformers
. At this point most of decoder architectures and encoder-decoder architectures are supported.
Highlights:
PPOTrainer
: A PPO trainer for language models that just needs (query, response, reward) triplets to optimise the language model.AutoModelForCausalLMWithValueHead
& AutoModelForSeq2SeqLMWithValueHead
: A transformer model with an additional scalar output for each token which can be used as a value function in reinforcement learning.Fine-tuning a language model via PPO consists of roughly three steps:
This process is illustrated in the sketch below:
Install the library with pip:
pip install trl
If you want to run the examples in the repository a few additional libraries are required. Clone the repository and install it with pip:
git clone https://github.com/lvwerra/trl.git
cd trl/
pip install .
If you wish to develop TRL, you should install in editable mode:
pip install -e .
This is a basic example on how to use the library. Based on a query the language model creates a response which is then evaluated. The evaluation could be a human in the loop or another model's output.
# imports
import torch
from transformers import AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead, create_reference_model
from trl.core import respond_to_batch
# get models
model = AutoModelForCausalLMWithValueHead.from_pretrained('gpt2')
model_ref = create_reference_model(model)
tokenizer = AutoTokenizer.from_pretrained('gpt2')
# initialize trainer
ppo_config = PPOConfig(
batch_size=1,
)
# encode a query
query_txt = "This morning I went to the "
query_tensor = tokenizer.encode(query_txt, return_tensors="pt")
# get model response
response_tensor = respond_to_batch(model, query_tensor)
# create a ppo trainer
ppo_trainer = PPOTrainer(ppo_config, model, model_ref, tokenizer)
# define a reward for response
# (this could be any reward such as human feedback or output from another model)
reward = [torch.tensor(1.0)]
# train model for one step with ppo
train_stats = ppo_trainer.step([query_tensor[0]], [response_tensor[0]], reward)
For a detailed example check out the example python script examples/sentiment/scripts/gpt2-sentiment.py
, where GPT2 is fine-tuned to generate positive movie reviews. An few examples from the language models before and after optimisation are given below:
The PPO implementation largely follows the structure introduced in the paper "Fine-Tuning Language Models from Human Preferences" by D. Ziegler et al. [paper, code].
The language models utilize the transformers
library by
@misc{vonwerra2022trl,
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert},
title = {TRL: Transformer Reinforcement Learning},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lvwerra/trl}}
}