Add LM Studio md file and add link into navigation (#1800)
Add LM Studio md file and add link into navigation
Minor phrasing revisions
Update lmstudio.md
Co-authored-by: Ren Xuancheng jklj077@users.noreply.github.com
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802047560号
Qwen3
💜 Qwen Chat | 🤗 Hugging Face | 🤖 ModelScope | 📑 Paper | 📑 Blog | 📖 Documentation
🖥️ Demo | 💬 WeChat (微信) | 🫨 Discord
Visit our Hugging Face or ModelScope organization (click links above), search checkpoints with names starting with
Qwen3-or visit the Qwen3 collection, and you will find all you need! Enjoy!To learn more about Qwen3, feel free to read our documentation [EN|ZH]. Our documentation consists of the following sections:
Introduction
Qwen3-2507
Over the past three months, we continued to explore the potential of the Qwen3 families and we are excited to introduce the updated Qwen3-2507 in two variants, Qwen3-Instruct-2507 and Qwen3-Thinking-2507, and three sizes, 235B-A22B, 30B-A3B, and 4B.
Qwen3-Instruct-2507 is the updated version of the previous Qwen3 non-thinking mode, featuring the following key enhancements:
Qwen3-Thinking-2507 is the continuation of Qwen3 thinking model, with improved quality and depth of reasoning, featuring the following key enhancements:
Previous Qwen3 Release
Qwen3 (aka Qwen3-2504)
We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5. We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models.
The highlights from Qwen3 include:
News
Performance
Detailed evaluation results are reported in this 📑 blog (Qwen3-2504) and this 📑 blog (Qwen3-2507) [coming soon].
For requirements on GPU memory and the respective throughput, see results here.
Run Qwen3
🤗 Transformers
Transformers is a library of pretrained natural language processing for inference and training. The latest version of
transformersis recommended andtransformers>=4.51.0is required.Qwen3-Instruct-2507
The following contains a code snippet illustrating how to use Qwen3-30B-A3B-Instruct-2507 to generate content based on given inputs.
Qwen3-Thinking-2507
The following contains a code snippet illustrating how to use Qwen3-30B-A3B-Thinking-2507 to generate content based on given inputs.
Switching Thinking/Non-thinking Modes for Previous Qwen3 Models
By default, Qwen3 models will think before response. This could be controlled by
enable_thinking=False: Passingenable_thinking=Falseto `tokenizer.apply_chat_template` will strictly prevent the model from generating thinking content./thinkand/no_thinkinstructions: Use those words in the system or user message to signify whether Qwen3 should think. In multi-turn conversations, the latest instruction is followed.ModelScope
We strongly advise users especially those in mainland China to use ModelScope. ModelScope adopts a Python API similar to Transformers. The CLI tool
modelscope downloadcan help you solve issues concerning downloading checkpoints. For vLLM and SGLang, the environment variableVLLM_USE_MODELSCOPE=trueandSGLANG_USE_MODELSCOPE=truecan be used respectively.llama.cpp
llama.cppenables LLM inference with minimal setup and state-of-the-art performance on a wide range of hardware.llama.cpp>=b5401is recommended for the full support of Qwen3.To use the CLI, run the following in a terminal:
To use the API server, run the following in a terminal:
A simple web front end will be at
http://localhost:8080and an OpenAI-compatible API will be athttp://localhost:8080/v1.For additional guides, please refer to our documentation.
Ollama
After installing Ollama, you can initiate the Ollama service with the following command (Ollama v0.9.0 or higher is recommended):
To pull a model checkpoint and run the model, use the
ollama runcommand. You can specify a model size by adding a suffix toqwen3, such as:8bor:30b-a3b:You can also access the Ollama service via its OpenAI-compatible API. Please note that you need to (1) keep
ollama serverunning while using the API, and (2) executeollama run qwen3:8bbefore utilizing this API to ensure that the model checkpoint is prepared. The API is athttp://localhost:11434/v1/by default.For additional details, please visit ollama.ai.
LMStudio
Qwen3 has already been supported by lmstudio.ai. You can directly use LMStudio with our GGUF files.
ExecuTorch
To export and run on ExecuTorch (iOS, Android, Mac, Linux, and more), please follow this example.
MNN
To export and run on MNN, which supports Qwen3 on mobile devices, please visit Alibaba MNN.
MLX LM
If you are running on Apple Silicon,
mlx-lmalso supports Qwen3 (mlx-lm>=0.24.0). Look for models ending with MLX on Hugging Face Hub.OpenVINO
If you are running on Intel CPU or GPU, OpenVINO toolkit supports Qwen3. You can follow this chatbot example.
Deploy Qwen3
Qwen3 is supported by multiple inference frameworks. Here we demonstrate the usage of
SGLang,vLLMandTensorRT-LLM. You can also find Qwen3 models from various inference providers, e.g., Alibaba Cloud Model Studio.SGLang
SGLang is a fast serving framework for large language models and vision language models. SGLang could be used to launch a server with OpenAI-compatible API service.
sglang>=0.4.6.post1is required.For Qwen3-Instruct-2507,
For Qwen3-Thinking-2507,
For Qwen3, it is
An OpenAI-compatible API will be available at
http://localhost:30000/v1.vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.
vllm>=0.9.0is recommended.For Qwen3-Instruct-2507,
For Qwen3-Thinking-2507,
For Qwen3, it is
An OpenAI-compatible API will be available at
http://localhost:8000/v1.TensorRT-LLM
TensorRT-LLM is an open-source LLM inference engine from NVIDIA, which provides optimizations including custom attention kernels, quantization and more on NVIDIA GPUs. Qwen3 is supported in its re-architected PyTorch backend.
tensorrt_llm>=0.20.0rc3is recommended. Please refer to the README page for more details.An OpenAI-compatible API will be available at
http://localhost:8000/v1.MindIE
For deployment on Ascend NPUs, please visit Modelers and search for Qwen3.
Build with Qwen3
Tool Use
For tool use capabilities, we recommend taking a look at Qwen-Agent, which provides a wrapper around these APIs to support tool use or function calling with MCP support. Tool use with Qwen3 can also be conducted with SGLang, vLLM, Transformers, llama.cpp, Ollama, etc. Follow guides in our documentation to see how to enable the support.
Finetuning
We advise you to use training frameworks, including Axolotl, UnSloth, Swift, Llama-Factory, etc., to finetune your models with SFT, DPO, GRPO, etc.
License Agreement
All our open-weight models are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories.
Citation
If you find our work helpful, feel free to give us a cite.
Contact Us
If you are interested to leave a message to either our research team or product team, join our Discord or WeChat groups!