QwenLM / Qwen3
- четверг, 1 мая 2025 г. в 00:00:02
Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.
💜 Qwen Chat | 🤗 Hugging Face | 🤖 ModelScope | 📑 Paper | 📑 Blog | 📖 Documentation
🖥️ Demo | 💬 WeChat (微信) | 🫨 Discord
Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with Qwen3-
or visit the Qwen3 collection, and you will find all you need! Enjoy!
To learn more about Qwen3, feel free to read our documentation [EN|ZH]. Our documentation consists of the following sections:
We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5. We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models.
The highlights from Qwen3 include:
Important
Qwen3 models adopt a different naming scheme.
The post-trained models do not use the "-Instruct" suffix anymore. For example, Qwen3-32B is the newer version of Qwen2.5-32B-Instruct.
The base models now have names ending with "-Base".
Detailed evaluation results are reported in this 📑 blog.
For requirements on GPU memory and the respective throughput, see results here.
Transformers is a library of pretrained natural language processing for inference and training.
The latest version of transformers
is recommended and transformers>=4.51.0
is required.
The following contains a code snippet illustrating how to use the model generate content based on given inputs.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language models."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# the result will begin with thinking content in <think></think> tags, followed by the actual response
print(tokenizer.decode(output_ids, skip_special_tokens=True))
By default, Qwen3 models will think before response. This could be controlled by
enable_thinking=False
: Passing enable_thinking=False
to tokenizer.apply_chat_template
will strictly prevent the model from generating thinking content./think
and /no_think
instructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.We strongly advise users especially those in mainland China to use ModelScope.
ModelScope adopts a Python API similar to Transformers.
The CLI tool modelscope download
can help you solve issues concerning downloading checkpoints.
llama.cpp
enables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.
llama.cpp>=b5092
is required.
To use the CLI, run the following in a terminal:
./llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --color -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift
# CTRL+C to exit
To use the API server, run the following in a terminal:
./llama-server -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja --reasoning-format deepseek -ngl 99 -fa -sm row --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -c 40960 -n 32768 --no-context-shift --port 8080
A simple web front end will be at http://localhost:8080
and an OpenAI-compatible API will be at http://localhost:8080/v1
.
For additional guides, please refer to our documentation.
Tip
llama.cpp adopts "rotating context management" and infinite generation is made possible by evicting earlier tokens. It could configured by parameters and the commands above effectively disable it. For more details, please refer to our documentation.
Important
The chat template uses features that are not supported by the template engine used by llama.cpp. As a result, you may encounter the following errors if the original chat template is used:
common_chat_templates_init: failed to parse chat template (defaulting to chatml)
We are working on a proper fix.
After installing Ollama, you can initiate the Ollama service with the following command (Ollama v0.6.6 or higher is required):
ollama serve
# You need to keep this service running whenever you are using ollama
To pull a model checkpoint and run the model, use the ollama run
command. You can specify a model size by adding a suffix to qwen3
, such as :8b
or :30b-a3b
:
ollama run qwen3:8b
# Setting parameters, type "/set parameter num_ctx 40960" and "/set parameter num_predict 32768"
# To exit, type "/bye" and press ENTER
You can also access the Ollama service via its OpenAI-compatible API.
Please note that you need to (1) keep ollama serve
running while using the API, and (2) execute ollama run qwen3:8b
before utilizing this API to ensure that the model checkpoint is prepared.
The API is at http://localhost:11434/v1/
by default.
For additional details, please visit ollama.ai.
Tip
Ollama adopts the same "rotating context management" with llama.cpp.
However, its default settings (num_ctx
2048 and num_predict
-1), suggesting infinite generation with a 2048-token context,
could lead to trouble for Qwen3 models.
We recommend setting num_ctx
and num_predict
properly.
Qwen3 has already been supported by lmstudio.ai. You can directly use LMStudio with our GGUF files.
If you are running on Apple Silicon, mlx-lm
also supports Qwen3 (mlx-lm>=0.24.0
).
Look for models ending with MLX on Hugging Face Hub.
Qwen3 is supported by multiple inference frameworks.
Here we demonstrate the usage of SGLang
and vLLM
.
You can also find Qwen3 models from various inference providers, e.g., Alibaba Cloud Model Studio.
SGLang is a fast serving framework for large language models and vision language models.
SGLang could be used to launch a server with OpenAI-compatible API service.
sglang>=0.4.6.post1
is required.
It is as easy as
python -m sglang.launch_server --model-path Qwen/Qwen3-8B --port 30000 --reasoning-parser qwen3
An OpenAI-compatible API will be available at http://localhost:30000/v1
.
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
vllm>=0.8.5
is recommended.
vllm serve Qwen/Qwen3-8B --port 8000 --enable-reasoning --reasoning-parser deepseek_r1
An OpenAI-compatible API will be available at http://localhost:8000/v1
.
For deployment on Ascend NPUs, please visit Modelers and search for Qwen3.
For tool use capabilities, we recommend taking a look at Qwen-Agent, which provides a wrapper around these APIs to support tool use or function calling with MCP support. Tool use with Qwen3 can also be conducted with SGLang, vLLM, Transformers, llama.cpp, Ollama, etc. Follow guides in our documentation to see how to enable the support.
We advise you to use training frameworks, including Axolotl, UnSloth, Swift, Llama-Factory, etc., to finetune your models with SFT, DPO, GRPO, etc.
All our open-source models are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories.
If you find our work helpful, feel free to give us a cite.
@article{qwen2.5,
title = {Qwen2.5 Technical Report},
author = {An Yang and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chengyuan Li and Dayiheng Liu and Fei Huang and Haoran Wei and Huan Lin and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Yang and Jiaxi Yang and Jingren Zhou and Junyang Lin and Kai Dang and Keming Lu and Keqin Bao and Kexin Yang and Le Yu and Mei Li and Mingfeng Xue and Pei Zhang and Qin Zhu and Rui Men and Runji Lin and Tianhao Li and Tingyu Xia and Xingzhang Ren and Xuancheng Ren and Yang Fan and Yang Su and Yichang Zhang and Yu Wan and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zihan Qiu},
journal = {arXiv preprint arXiv:2412.15115},
year = {2024}
}
@article{qwen2,
title = {Qwen2 Technical Report},
author = {An Yang and Baosong Yang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Zhou and Chengpeng Li and Chengyuan Li and Dayiheng Liu and Fei Huang and Guanting Dong and Haoran Wei and Huan Lin and Jialong Tang and Jialin Wang and Jian Yang and Jianhong Tu and Jianwei Zhang and Jianxin Ma and Jin Xu and Jingren Zhou and Jinze Bai and Jinzheng He and Junyang Lin and Kai Dang and Keming Lu and Keqin Chen and Kexin Yang and Mei Li and Mingfeng Xue and Na Ni and Pei Zhang and Peng Wang and Ru Peng and Rui Men and Ruize Gao and Runji Lin and Shijie Wang and Shuai Bai and Sinan Tan and Tianhang Zhu and Tianhao Li and Tianyu Liu and Wenbin Ge and Xiaodong Deng and Xiaohuan Zhou and Xingzhang Ren and Xinyu Zhang and Xipin Wei and Xuancheng Ren and Yang Fan and Yang Yao and Yichang Zhang and Yu Wan and Yunfei Chu and Yuqiong Liu and Zeyu Cui and Zhenru Zhang and Zhihao Fan},
journal = {arXiv preprint arXiv:2407.10671},
year = {2024}
}
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!