EricLBuehler / mistral.rs

воскресенье, 28 апреля 2024 г. в 00:00:02

https://github.com/EricLBuehler/mistral.rs

Blazingly fast LLM inference.

mistral.rs

Blazingly fast LLM inference.

| Rust Documentation | Python Documentation | Discord |

Mistral.rs is a fast LLM inference platform supporting inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

Upcoming features

More models: please submit requests here.
X-LoRA: Scalings topk and softmax topk (#48).
Parallel linear layers (sharding) (#50).
Speculative decoding: https://arxiv.org/pdf/2211.17192

Running the new Llama 3 model

cargo run --release --features ... -- -i plain -m meta-llama/Meta-Llama-3-8B-Instruct -a llama

Running the new Phi 3 model with 128K context window

cargo run --release --features ... -- -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Description

Fast:

Quantized model support: 2-bit, 3-bit, 4-bit, 5-bit, 6-bit and 8-bit for faster inference and optimized memory usage.
Continuous batching.
Prefix caching.
Device mapping: load and run some layers on the device and the reset on the CPU.

Accelerator support:

Apple silicon support with the Metal framework.
CPU inference with mkl, accelerate support and optimized backend.
CUDA support with flash attention and cuDNN.

Easy:

Lightweight OpenAI API compatible HTTP server.
Python API.
Grammar support with Regex and Yacc.
ISQ (In situ quantization): run .safetensors models directly from Huggingface Hub by quantizing them after loading instead of creating a GGUF file.

Powerful:

Fast LoRA support with weight merging.
First X-LoRA inference platform with first class support.

This is a demo of interactive mode with streaming running Mistral GGUF:

demo_new.mp4

Supported models:

Mistral 7B (v0.1 and v0.2)
Gemma
Llama, including Llama 3
Mixtral 8x7B
Phi 2
Phi 3
Qwen 2

Please see this section for details on quantization and LoRA support.

APIs and Integrations

Rust Library API

Rust multithreaded API for easy integration into any application.

Docs
Examples
To install: Add mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }

Python API

Python API for mistral.rs.

from mistralrs import Runner, Which, ChatCompletionRequest, Message, Role

runner = Runner(
    which=Which.GGUF(
        tok_model_id="mistralai/Mistral-7B-Instruct-v0.1",
        quantized_model_id="TheBloke/Mistral-7B-Instruct-v0.1-GGUF",
        quantized_filename="mistral-7b-instruct-v0.1.Q4_K_M.gguf",
        tokenizer_json=None,
        repeat_last_n=64,
    )
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="mistral",
        messages=[Message(Role.User, "Tell me a story about the Rust type system.")],
        max_tokens=256,
        presence_penalty=1.0,
        top_p=0.1,
        temperature=0.1,
    )
)
print(res.choices[0].message.content)
print(res.usage)

HTTP Server

OpenAI API compatible API server

Llama Index integration

Supported accelerators

CUDA:
- Enable with cuda feature: --features cuda
- Flash attention support with flash-attn feature, only applicable to non-quantized models: --features flash-attn
- cuDNNsupport with cudnn feature: --features cudnn
Metal:
- Enable with metal feature: --features metal
CPU:
- Intel MKL with mkl feature: --features mkl
- Apple Accelerate with accelerate feature: --features accelerate

Enabling features is done by passing --features ... to the build system. When using cargo run or maturin develop, pass the --features flag before the -- separating build flags from runtime flags.

To enable a single feature like metal: cargo build --release --features metal.
To enable multiple features, specify them in quotes: cargo build --release --features "cuda flash-attn cudnn".

Benchmarks

Device	Mistral.rs Completion T/s	Llama.cpp Completion T/s	Model	Quant
A10 GPU, CUDA	78	78	mistral-7b	4_K_M
Intel Xeon 8358 CPU, AVX	6	19	mistral-7b	4_K_M
Raspberry Pi 5 (8GB), Neon	2	segfault	mistral-7b	2_K
A100 GPU, CUDA	110	119	mistral-7b	4_K_M

Please submit more benchmarks via raising an issue!

Usage

Installation and Build

To install mistral.rs, one should ensure they have Rust installed by following this link. Additionally, the Huggingface token should be provided in ~/.cache/huggingface/token when using the server to enable automatic download of gated models.

Install required packages
- openssl (ex., sudo apt install libssl-dev)
- pkg-config (ex., sudo apt install pkg-config)

Install Rust: https://rustup.rs/

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source parameters in Python or the command line.)
```
mkdir ~/.cache/huggingface
touch ~/.cache/huggingface/token
echo <HF_TOKEN_HERE> > ~/.cache/huggingface/token
```

Download the code

git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs

Build or install
- Base build command
```
cargo build --release
```
- Build with CUDA support
```
cargo build --release --features cuda
```
- Build with CUDA and Flash Attention V2 support
```
cargo build --release --features "cuda flash-attn"
```
- Build with Metal support
```
cargo build --release --features metal
```
- Build with Accelerate support
```
cargo build --release --features accelerate
```
- Build with MKL support
```
cargo build --release --features mkl
```
- Install with cargo install for easy command line usage
  
  Pass the same values to --features as you would for cargo build
```
cargo install --path mistralrs-server --features cuda
```
The build process will output a binary misralrs-server at ./target/release/mistralrs-server which may be copied into the working directory with the following command:
```
cp ./target/release/mistralrs-server ./mistralrs_server
```
Installing Python support

You can install Python support by following the guide here.

Getting models

Loading from HF Hub:

Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:

literal:<value>: Load from a specified literal
env:<value>: Load from a specified environment variable
path:<value>: Load from a specified file
cache: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.
none: Use no HF token

This is passed in the following ways:

Command line:

./mistralrs_server --token-source none -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3

Python:

Here is an example of setting the token source.

If token cannot be loaded, no token will be used (i.e. effectively using none).

Loading from local files:

You can also instruct mistral.rs to load models locally by modifying the *_model_id arguments or options:

./mistralrs_server --port 1234 plain -m . -a mistral

The following files must be present in the paths for the options below:

--model-id (server) or model_id (python) or --tok-model-id (server) or tok_model_id (python):
- config.json
- tokenizer_config.json
- tokenizer.json (if not specified separately)
- .safetensors files.
--quantized-model-id (server) or quantized_model_id (python):
- Specified .gguf or .ggml file.
--x-lora-model-id (server) or xlora_model_id (python):
- xlora_classifier.safetensors
- xlora_config.json
- Adapters .safetensors and adapter_config.json files in their respective directories

Run

To start a server serving Mistral GGUF on localhost:1234,

./mistralrs_server --port 1234 --log output.log gguf -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -t mistralai/Mistral-7B-Instruct-v0.1 -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>. Please run ./mistralrs_server --help to see the subcommands.

Additionally, for models without quantization, the model architecture should be provided as the --arch or -a argument in contrast to GGUF models which encode the architecture in the file. It should be one of the following:

mistral
gemma
mixtral
llama
phi2
phi3
qwen2

Interactive mode:

You can launch interactive mode, a simple chat application running in the terminal, by passing -i:

./mistralrs_server -i gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Quick examples:

X-LoRA with no quantization

To start an X-LoRA server with the exactly as presented in the paper:

./mistralrs_server --port 1234 x-lora-plain -o orderings/xlora-paper-ordering.json -x lamm-mit/x-lora

LoRA with a model from GGUF

To start an LoRA server with adapters from the X-LoRA paper (you should modify the ordering file to use only one adapter, as the adapter static scalings are all 1 and so the signal will become distorted):

./mistralrs_server --port 1234 lora-gguf -o orderings/xlora-paper-ordering.json -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q8_0.gguf -x lamm-mit/x-lora

Normally with a LoRA model you would use a custom ordering file. However, for this example we use the ordering from the X-LoRA paper because we are using the adapters from the X-LoRA paper.

With a model from GGUF

To start a server running Mistral from GGUF:

./mistralrs_server --port 1234 gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

With a model from GGML

To start a server running Llama from GGML:

./mistralrs_server --port 1234 ggml -t meta-llama/Llama-2-13b-chat-hf -m TheBloke/Llama-2-13B-chat-GGML -f llama-2-13b-chat.ggmlv3.q4_K_M.bin

Plain model from safetensors

To start a server running Mistral from safetensors.

./mistralrs_server --port 1234 gguf -m mistralai/Mistral-7B-Instruct-v0.1

Command line docs

Command line docs here

Supported models

Quantization support

Model	GGUF	GGML
Mistral 7B	✅
Gemma
Llama	✅	✅
Mixtral 8x7B	✅
Phi 2	✅
Phi 3	✅
Qwen 2

Device mapping support

Model	Supported
Normal	✅
GGUF	✅
GGML

X-LoRA and LoRA support

Model	X-LoRA	X-LoRA+GGUF	X-LoRA+GGML
Mistral 7B	✅	✅
Gemma	✅
Llama	✅	✅	✅
Mixtral 8x7B	✅	✅
Phi 2	✅
Phi 3	✅	✅
Qwen 2

Using derivative models

To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:

Normal: Model id
Quantized: Quantized model id, quantized filename, and tokenizer id
X-LoRA: Model id, X-LoRA ordering
X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
LoRA: Model id, LoRA ordering
LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering

See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.

It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json file, it is not necessary to specify. Otherwise, templates can be found in chat_templates and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.

For example, when using a Zephyr model:

./mistralrs_server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf

Adapter model support: X-LoRA and LoRA

An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here

Chat Templates and Tokenizer

Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.

Contributing

If you have any problems or want to contribute something, please raise an issue or pull request!

Consider enabling RUST_LOG=debug environment variable.

If you want to add a new model, please see our guide.

Credits

This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.