intel / intel-extension-for-transformers
- суббота, 25 ноября 2023 г. в 00:00:13
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
🏭Architecture | 💬NeuralChat | 😃Inference | 💻Examples | 📖Documentations
pip install intel-extension-for-transformers
For more installation methods, please refer to Installation Page
Intel® Extension for Transformers is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids). The toolkit provides the below key features and examples:
Seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Advanced software optimizations and unique compression-aware runtime (released with NeurIPS 2022's paper Fast Distilbert on CPUs and QuaLA-MiniLM: a Quantized Length Adaptive MiniLM, and NeurIPS 2021's paper Prune Once for All: Sparse Pre-Trained Language Models)
Optimized Transformer-based model packages such as Stable Diffusion, GPT-J-6B, GPT-NEOX, BLOOM-176B, T5, Flan-T5, and end-to-end workflows such as SetFit-based text classification and document level sentiment analysis (DLSA)
NeuralChat, a customizable chatbot framework to create your own chatbot within minutes by leveraging a rich set of plugins Knowledge Retrieval, Speech Interaction, Query Caching, and Security Guardrail.
Inference of Large Language Model (LLM) in pure C/C++ with weight-only quantization kernels, supporting GPT-NEOX, LLAMA, MPT, FALCON, BLOOM-7B, OPT, ChatGLM2-6B, GPT-J-6B, and Dolly-v2-3B. Support AMX, VNNI, AVX512F and AVX2 instruction set.
Below is the sample code to enable the chatbot. See more examples.
# pip install intel-extension-for-transformers
from intel_extension_for_transformers.neural_chat import build_chatbot
chatbot = build_chatbot()
response = chatbot.predict("Tell me about Intel Xeon Scalable Processors.")
Below is the sample code to enable weight-only INT4/INT8 inference. See more examples.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
You can access the latest int4 performance and accuracy at int4 blog.
Additionally, we are preparing to introduce Baichuan, Mistral, and other models into LLM Runtime (Intel Optimized llamacpp). For comprehensive accuracy and performance data, though not the most up-to-date, please refer to the Release data.
OVERVIEW | |||||||
---|---|---|---|---|---|---|---|
NeuralChat | LLM Runtime | ||||||
NEURALCHAT | |||||||
Chatbot on Intel CPU | Chatbot on Intel GPU | Chatbot on Gaudi | |||||
Chatbot on Client | More Notebooks | ||||||
LLM RUNTIME | |||||||
LLM Runtime | Streaming LLM | Low Precision Kernels | Tensor Parallelism | ||||
LLM COMPRESSION | |||||||
SmoothQuant (INT8) | Weight-only Quantization (INT4/FP4/NF4/INT8) | QLoRA on CPU | |||||
GENERAL COMPRESSION | |||||||
Quantization | Pruning | Distillation | Orchestration | ||||
Neural Architecture Search | Export | Metrics | Objectives | ||||
Pipeline | Length Adaptive | Early Exit | Data Augmentation | ||||
TUTORIALS & RESULTS | |||||||
Tutorials | LLM List | General Model List | Model Performance |
View Full Publication List.
Excellent open-source projects: bitsandbytes, FastChat, fastRAG, ggml, gptq, llama.cpp, lm-evauation-harness, peft, trl, streamingllm and many others.
Thanks to all the contributors.
Welcome to raise any interesting ideas on model compression techniques and LLM-based chatbot development! Feel free to reach us, and we look forward to our collaborations on Intel Extension for Transformers!