liltom-eth / llama2-webui
- среда, 26 июля 2023 г. в 00:00:02
Run Llama 2 locally with gradio UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama-2-7B/13B/70B with 8-bit, 4-bit. Supporting GPU inference (6 GB VRAM) and CPU inference.
Running Llama 2 with gradio web UI on GPU or CPU from anywhere (Linux/Windows/Mac). Supporting Llama 2 7B, 13B, 70B with 8-bit, 4-bit mode. Supporting GPU inference with at least 6 GB VRAM, and CPU inference with at least 6 GB RAM.
Supporting models: Llama-2-7b/13b/70b, all Llama-2-GPTQ, all Llama-2-GGML ...
Supporting model backends
Nvidia GPU: tranformers, bitsandbytes(8-bit inference), AutoGPTQ(4-bit inference)
CPU, Mac/AMD GPU: llama.cpp
Web UI interface: gradio
pip install -r requirements.txt
bitsandbytes >= 0.39 may not work on older NVIDIA GPUs. In that case, to use LOAD_IN_8BIT, you may have to downgrade like this:
pip install bitsandbytes==0.38.1If run on CPU, install llama.cpp additionally by pip install llama-cpp-python.
Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters.
Llama-2-7b-Chat-GPTQ is the GPTQ model files for Meta's Llama 2 7b Chat. GPTQ 4-bit Llama-2 model require less GPU VRAM to run it.
| Model Name | set MODEL_PATH in .env | Download URL |
|---|---|---|
| meta-llama/Llama-2-7b-chat-hf | /path-to/Llama-2-7b-chat-hf | Link |
| meta-llama/Llama-2-13b-chat-hf | /path-to/Llama-2-13b-chat-hf | Link |
| meta-llama/Llama-2-70b-chat-hf | /path-to/Llama-2-70b-chat-hf | Link |
| meta-llama/Llama-2-7b-hf | /path-to/Llama-2-7b-hf | Link |
| meta-llama/Llama-2-13b-hf | /path-to/Llama-2-13b-hf | Link |
| meta-llama/Llama-2-70b-hf | /path-to/Llama-2-70b-hf | Link |
| TheBloke/Llama-2-7b-Chat-GPTQ | /path-to/Llama-2-7b-Chat-GPTQ | Link |
| TheBloke/Llama-2-7B-Chat-GGML | /path-to/llama-2-7b-chat.ggmlv3.q4_0.bin | Link |
| ... | ... | ... |
Running 4-bit model Llama-2-7b-Chat-GPTQ needs GPU with 6GB VRAM.
Running 4-bit model llama-2-7b-chat.ggmlv3.q4_0.bin needs CPU with 6GB RAM. There is also a list of other 2, 3, 4, 5, 6, 8-bit GGML models that can be used from TheBloke/Llama-2-7B-Chat-GGML.
These models can be downloaded from the link using CMD like:
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone git@hf.co:meta-llama/Llama-2-7b-chat-hfTo download Llama 2 models, you need to request access from https://ai.meta.com/llama/ and also enable access on repos like meta-llama/Llama-2-7b-chat-hf. Requests will be processed in hours.
For GPTQ models like TheBloke/Llama-2-7b-Chat-GPTQ, you can directly download without requesting access.
For GGML models like TheBloke/Llama-2-7B-Chat-GGML, you can directly download without requesting access.
Setup your MODEL_PATH and model configs in .env file.
There are some examples in ./env_examples/ folder.
| Model Setup | Example .env |
|---|---|
| Llama-2-7b-chat-hf 8-bit on GPU | .env.7b_8bit_example |
| Llama-2-7b-Chat-GPTQ 4-bit on GPU | .env.7b_gptq_example |
| Llama-2-7B-Chat-GGML 4bit on CPU | .env.7b_ggmlv3_q4_0_example |
| Llama-2-13b-chat-hf on GPU | .env.13b_example |
| ... | ... |
Run chatbot with web UI:
python app.py
The running requires around 14GB of GPU VRAM for Llama-2-7b and 28GB of GPU VRAM for Llama-2-13b.
If you are running on multiple GPUs, the model will be loaded automatically on GPUs and split the VRAM usage. That allows you to run Llama-2-7b (requires 14GB of GPU VRAM) on a setup like 2 GPUs (11GB VRAM each).
If you do not have enough memory, you can set up your LOAD_IN_8BIT as True in .env. This can reduce memory usage by around half with slightly degraded model quality. It is compatible with the CPU, GPU, and Metal backend.
Llama-2-7b with 8-bit compression can run on a single GPU with 8 GB of VRAM, like an Nvidia RTX 2080Ti, RTX 4080, T4, V100 (16GB).
If you want to run 4 bit Llama-2 model like Llama-2-7b-Chat-GPTQ, you can set up your LOAD_IN_4BIT as True in .env like example .env.7b_gptq_example.
Make sure you have downloaded the 4-bit model from Llama-2-7b-Chat-GPTQ and set the MODEL_PATH and arguments in .env file.
Llama-2-7b-Chat-GPTQ can run on a single GPU with 6 GB of VRAM.
Run Llama-2 model on CPU requires llama.cpp dependency and llama.cpp Python Bindings.
pip install llama-cpp-pythonDownload GGML models like llama-2-7b-chat.ggmlv3.q4_0.bin following Download Llama-2 Models section. llama-2-7b-chat.ggmlv3.q4_0.bin model requires at least 6 GB RAM to run on CPU.
Set up configs like .env.7b_ggmlv3_q4_0_example from env_examples as .env.
Run web UI python app.py .
If you would like to use Mac GPU and AMD/Nvidia GPU for acceleration, check these:
Kindly read our Contributing Guide to learn and understand about our development process.
MIT - see MIT License