MDK8888 / GPTFast
- среда, 28 февраля 2024 г. в 00:00:10
Accelerate your Hugging Face Transformers 6-7x. Native to Hugging Face and PyTorch.
Accelerate your Hugging Face Transformers 6-7x with GPTFast!
GPTFast was originally a set of techniques developed by the PyTorch Team to accelerate the inference speed of Llama-2-7b. This pip package generalizes those techniques to all Hugging Face models.
GPTFast Inference Time | Eager Inference Time |
---|---|
![]() |
![]() |
$python3 -m venv VENV_NAME
source VENV_NAME/bin/activate #./VENV_NAME/scripts/activate if you are on Windows
pip install gptfast
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from GPTFast.Core import gpt_fast
from GPTFast.Helpers import timed
torch._dynamo.reset()
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = "cuda" if torch.cuda.is_available() else "cpu"
def argmax(self, probabilities):
# Use argmax to get the token with the maximum probability
max_prob_index = torch.argmax(probabilities, dim=-1)
return max_prob_index.unsqueeze(0)
model_name = "gpt2-xl"
draft_model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
initial_string = "Write me a short story."
input_tokens = tokenizer.encode(initial_string, return_tensors="pt").to(device)
N_ITERS=10
MAX_TOKENS=50
gpt_fast_model = gpt_fast(model_name, draft_model_name=draft_model_name, sample_function=argmax)
gpt_fast_model.to(device)
fast_compile_times = []
for i in range(N_ITERS):
with torch.no_grad():
res, compile_time = timed(lambda: gpt_fast_model.generate(cur_tokens=input_tokens, max_tokens=MAX_TOKENS, speculate_k=6))
fast_compile_times.append(compile_time)
print(f"gpt fast eval time {i}: {compile_time}")
print("~" * 10)
At its core, this library provides a simple interface to LLM Inference acceleration techniques. All of the following functions can be imported from GPTFast.Core
:
gpt_fast(model_name:str, draft_model_name:str, sample_function:Callable) -> torch.nn.Module
model_name
: This is the name of the Hugging face model that you want to optimize.draft_model_name
: This is the name of the Hugging face draft model which is needed for speculative decoding. Note that the model and the draft model must both use the same tokenizer, and the draft model must be significantly smaller to achieve inference acceleration.sample function(distribution, **kwargs)
: This is a function which is used to sample from the distribution generated by the main model. This function has a mandatory parameter which is a tensor of dimension (seq_len, vocab_size)
and returns a tensor of shape (1, 1)
.generate(self, cur_tokens:torch.Tensor, max_tokens:int, speculate_k:int, **sampling_kwargs) -> torch.Tensor
cur_tokens
: A PyTorch Tensor of size (1, seq_len).max_tokens
: An int representing how many tokens you want to generate.speculate_k
: An int specifying how far you want the draft model to speculate in speculative decoding.**sampling_kwargs
: Additional parameters that are necessary for sampling from the distribution. Should match the **kwargs
of the sample
function above.(1, max_tokens)
.load_int8(model_name:str) -> torch.nn.Module
model_name
: This is a string specifying the model that you are using.int8
quantized version of your model.add_kv_cache(model_name:str) -> KVCacheModel
model_name
: This is a string specifying the model that you are using.KVCacheModel
class which is essentially just your model but with a key-value cache attached for accelerated inference.add_speculative_decode_kv_cache(model:KVCacheModel, draft_model:KVCacheModel, sample_function:Callable) -> torch.nn.Module
model
: This is the KVCached version of your model.draft_model
: This is the KVCached version of your draft model.sample function(distribution, **kwargs)
: same as the documentation for gpt_fast
.generate
method described above under the gpt_fast
section.