目录

GeeeekExplorer%2Fnano-vllm | Trendshift

Nano-vLLM

A lightweight vLLM implementation built from scratch.

Key Features

  • 🚀 Fast offline inference - Comparable inference speeds to vLLM
  • 📖 Readable codebase - Clean implementation in ~ 1,200 lines of Python code
  • Optimization Suite - Prefix caching, Tensor Parallelism, Torch compilation, CUDA graph, etc.

Installation

pip install git+https://github.com/GeeeekExplorer/nano-vllm.git

Model Download

To download the model weights manually, use the following command:

huggingface-cli download --resume-download Qwen/Qwen3-0.6B \
  --local-dir ~/huggingface/Qwen3-0.6B/ \
  --local-dir-use-symlinks False

Quick Start

See example.py for usage. The API mirrors vLLM’s interface with minor differences in the LLM.generate method:

from nanovllm import LLM, SamplingParams
llm = LLM("/YOUR/MODEL/PATH", enforce_eager=True, tensor_parallel_size=1)
sampling_params = SamplingParams(temperature=0.6, max_tokens=256)
prompts = ["Hello, Nano-vLLM."]
outputs = llm.generate(prompts, sampling_params)
outputs[0]["text"]

Benchmark

See bench.py for benchmark.

Test Configuration:

  • Hardware: RTX 4070 Laptop (8GB)
  • Model: Qwen3-0.6B
  • Total Requests: 256 sequences
  • Input Length: Randomly sampled between 100–1024 tokens
  • Output Length: Randomly sampled between 100–1024 tokens

Performance Results: | Inference Engine | Output Tokens | Time (s) | Throughput (tokens/s) | |—————-|————-|———-|———————–| | vLLM | 133,966 | 98.37 | 1361.84 | | Nano-vLLM | 133,966 | 93.41 | 1434.13 |

Star History

Star History Chart

关于

nano_vllm_notes

581.0 KB
邀请码
    Gitlink(确实开源)
  • 加入我们
  • 官网邮箱:gitlink@ccf.org.cn
  • QQ群
  • QQ群
  • 公众号
  • 公众号

版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号 京公网安备 11010802032778号