Dynamo

The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn’t replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.

Built in Rust for performance, Python for extensibility.

When to use Dynamo

You’re serving LLMs across multiple GPUs or nodes and need to coordinate them
You want KV-aware routing to avoid redundant prefill computation
You need to independently scale prefill and decode (disaggregated serving)
You want automatic scaling that meets latency SLAs at minimum total cost of ownership (TCO)
You need fast cold-starts when spinning up new replicas

If you’re running a single model on a single GPU, your inference engine alone is probably sufficient.

Feature support at a glance:

	SGLang	TensorRT-LLM	vLLM
Disaggregated Serving	✅	✅	✅
KV-Aware Routing	✅	✅	✅
SLA-Based Planner	✅	✅	✅
KVBM	🚧	✅	✅
Multimodal	✅	✅	✅
Tool Calling	✅	✅	✅

Full Feature Matrix → — LoRA, request migration, speculative decoding, and feature interactions.

Key Results

Result	Context
7x higher throughput per GPU	DeepSeek R1 on GB200 NVL72 w/ Dynamo vs B200 without (InferenceX)
7x faster model startup	ModelExpress weight streaming (DeepSeek-V3 on H200)
2x faster time to first token	KV-aware routing, Qwen3-Coder 480B (Baseten benchmark)
80% fewer SLA breaches	Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025 @ 2:50:00)
750x higher throughput	DeepSeek-R1 on GB300 NVL72 (InferenceXv2)

What Dynamo Does

Most inference engines optimize a single GPU or a single node. Dynamo is the orchestration layer above them — it turns a cluster of GPUs into a coordinated inference system.

Dynamo architecture overview

Architecture Deep Dive →

Core Capabilities

Capability	What it does	Why it matters
Disaggregated Prefill/Decode	Separates prefill and decode into independently scalable GPU pools	Maximizes GPU utilization; each phase runs on hardware tuned for its workload
KV-Aware Routing	Routes requests based on worker load and KV cache overlap	Eliminates redundant prefill computation — 2x faster TTFT
KV Block Manager (KVBM)	Offloads KV cache across GPU → CPU → SSD → remote storage	Extends effective context length beyond GPU memory
ModelExpress	Streams model weights GPU-to-GPU via NIXL/NVLink	7x faster cold-start for new replicas
Planner	SLA-driven autoscaler that profiles workloads and right-sizes pools	Meets latency targets at minimum total cost of ownership (TCO)
Grove	K8s operator for topology-aware gang scheduling (NVL72)	Places workloads optimally across racks, hosts, and NUMA nodes
AIConfigurator	Simulates 10K+ deployment configs in seconds	Finds optimal serving config without burning GPU-hours
Fault Tolerance	Canary health checks + in-flight request migration	Workers fail; user requests don’t

New in 1.0

Zero-config deploy (DGDR) (beta): Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys
Agentic inference: Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations
Multimodal E/P/D: Disaggregated encode/prefill/decode with embedding cache — 30% faster TTFT on image workloads
Video generation: Native FastVideo + SGLang Diffusion support — real-time 1080p on single B200
K8s Inference Gateway plugin: KV-aware routing inside the standard Kubernetes gateway
Storage-tier KV offload: S3/Azure blob support + global KV events for cluster-wide cache visibility

Quick Start

Option A: Container (fastest)

# Pull a prebuilt container (SGLang example)
docker run --gpus all --network host --rm -it nvcr.io/nvidia/ai-dynamo/sglang-runtime:1.0.1

# Inside the container — start frontend and worker
python3 -m dynamo.frontend --http-port 8000 --discovery-backend file > /dev/null 2>&1 &
python3 -m dynamo.sglang --model-path Qwen/Qwen3-0.6B --discovery-backend file &

# Send a request
curl -s localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "Qwen/Qwen3-0.6B",
  "messages": [{"role": "user", "content": "Hello!"}],
  "max_tokens": 100
}' | jq

Also available: tensorrtllm-runtime:1.0.1 and vllm-runtime:1.0.1.

Option B: Install from PyPI

Install uv (curl -LsSf https://astral.sh/uv/install.sh | sh), then:

uv pip install --prerelease=allow "ai-dynamo[sglang]"   # or [vllm]

Note: TensorRT-LLM requires pip with --extra-index-url https://pypi.nvidia.com. See the install guide for TRT-LLM-specific instructions.

Then start the frontend and a worker as shown above. See the full installation guide for system dependencies and backend-specific notes.

Option C: Kubernetes (recommended)

For production multi-node clusters, install the Dynamo Platform and deploy with a single manifest:

# Zero-config deploy: specify model + SLA, Dynamo handles the rest
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
metadata:
  name: my-model
spec:
  model: Qwen/Qwen3-0.6B
  backend: vllm
  sla:
    ttft: 200.0   # ms
    itl: 20.0     # ms
  autoApply: true

Pre-built recipes for common models:

Model	Framework	Mode	Recipe
Llama-3-70B	vLLM	Aggregated	View
DeepSeek-R1	SGLang	Disaggregated	View
Qwen3-32B-FP8	TensorRT-LLM	Aggregated	View

See recipes/ for the full list. Cloud-specific guides: AWS EKS · Google GKE

Building from Source

For contributors who want to build and develop locally. See the full build guide for details.

# Install system deps (Ubuntu 24.04)
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && source $HOME/.cargo/env

# Create venv and build
uv venv dynamo && source dynamo/bin/activate
uv pip install pip maturin
cd lib/bindings/python && maturin develop --uv && cd $PROJECT_ROOT
uv pip install -e lib/gpu_memory_service
uv pip install -e .

VSCode/Cursor users: see the .devcontainer for a pre-configured dev environment.

Community and Contributing

Dynamo is built in the open with an OSS-first development model. We welcome contributions of all kinds.

Contribution Guide — How to contribute code, docs, and recipes
Design Proposals — RFCs for major features
Office Hours — Biweekly calls
Community Meetings – Weekly (Fri 2:30 PM PT) development community meetings
Discord — Chat with the team and community
Dynamo Day Recordings — Deep dives from production users

Latest News

Older news

Dynamo provides comprehensive benchmarking tools:

Benchmarking Guide – Compare deployment topologies using AIPerf
SLA-Driven Deployments – Optimize deployments to meet SLA requirements

Frontend OpenAPI Specification

The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:

cargo run -p dynamo-llm --bin generate-frontend-openapi

This writes to docs/reference/api/openapi.json.

Service Discovery and Messaging

Dynamo uses TCP for inter-component communication. On Kubernetes, native resources (CRDs + EndpointSlices) handle service discovery. External services are optional for most deployments:

Deployment	etcd	NATS	Notes
Local Development	❌ Not required	❌ Not required	Pass `--discovery-backend file`; vLLM also needs `--kv-events-config '{"enable_kv_cache_events": false}'`
Kubernetes	❌ Not required	❌ Not required	K8s-native discovery; TCP request plane

Note: KV-Aware Routing requires NATS for prefix caching coordination.

For Slurm or other distributed deployments (and KV-aware routing):

etcd can be run directly as ./etcd.
nats needs JetStream enabled: nats-server -js.

To quickly setup both: docker compose -f deploy/docker-compose.yml up -d

More News

Reference

Support Matrix — Hardware, OS, CUDA, and backend versions
Feature Matrix — Detailed backend compatibility
Release Artifacts — Containers, wheels, Helm charts
Service Discovery — K8s-native vs etcd vs file-based discovery
Benchmarking Guide — Compare deployment topologies with AIPerf