vllm-project / vllm
- суббота, 24 июня 2023 г. в 00:00:03
A high-throughput and memory-efficient inference and serving engine for LLMs
| Documentation | Blog | Discussions |
Latest News
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
vLLM is flexible and easy to use with:
vLLM seamlessly supports many Huggingface models, including the following architectures:
gpt2
, gpt2-xl
, etc.)bigcode/starcoder
, bigcode/gpt_bigcode-santacoder
, etc.)EleutherAI/gpt-neox-20b
, databricks/dolly-v2-12b
, stabilityai/stablelm-tuned-alpha-7b
, etc.)lmsys/vicuna-13b-v1.3
, young-geng/koala
, openlm-research/open_llama_13b
, etc.)facebook/opt-66b
, facebook/opt-iml-max-30b
, etc.)Install vLLM with pip or from source:
pip install vllm
Visit our documentation to get started.
vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput. For details, check out our blog post.
Serving throughput when each request asks for 1 output completion.
Serving throughput when each request asks for 3 output completions.
We welcome and value any contributions and collaborations. Please check out CONTRIBUTING.md for how to get involved.