The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn’t replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.
Built in Rust for performance, Python for extensibility.
When to use Dynamo
You’re serving LLMs across multiple GPUs or nodes and need to coordinate them
You want KV-aware routing to avoid redundant prefill computation
You need to independently scale prefill and decode (disaggregated serving)
You want automatic scaling that meets latency SLAs at minimum total cost of ownership (TCO)
You need fast cold-starts when spinning up new replicas
If you’re running a single model on a single GPU, your inference engine alone is probably sufficient.
Most inference engines optimize a single GPU or a single node. Dynamo is the orchestration layer above them — it turns a cluster of GPUs into a coordinated inference system.
Canary health checks + in-flight request migration
Workers fail; user requests don’t
New in 1.0
Zero-config deploy (DGDR)(beta): Specify model, HW, and SLA in one YAML — AIConfigurator auto-profiles the workload, Planner optimizes the topology, and Dynamo deploys
Agentic inference: Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations
Multimodal E/P/D: Disaggregated encode/prefill/decode with embedding cache — 30% faster TTFT on image workloads
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at /openapi.json. To generate without running the server:
cargo run -p dynamo-llm --bin generate-frontend-openapi
This writes to docs/reference/api/openapi.json.
Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources (CRDs + EndpointSlices) handle service discovery. External services are optional for most deployments:
Deployment
etcd
NATS
Notes
Local Development
❌ Not required
❌ Not required
Pass --discovery-backend file; vLLM also needs --kv-events-config '{"enable_kv_cache_events": false}'
Kubernetes
❌ Not required
❌ Not required
K8s-native discovery; TCP request plane
Note: KV-Aware Routing requires NATS for prefix caching coordination.
For Slurm or other distributed deployments (and KV-aware routing):
| Docs | Roadmap | Recipes | Examples | Prebuilt Containers | Blog | Design Proposals | How to Contribute |
Dynamo
The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn’t replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.
Built in Rust for performance, Python for extensibility.
When to use Dynamo
If you’re running a single model on a single GPU, your inference engine alone is probably sufficient.
Feature support at a glance:
Key Results
What Dynamo Does
Most inference engines optimize a single GPU or a single node. Dynamo is the orchestration layer above them — it turns a cluster of GPUs into a coordinated inference system.
Architecture Deep Dive →
Core Capabilities
New in 1.0
Quick Start
Option A: Container (fastest)
Also available:
tensorrtllm-runtime:1.0.1andvllm-runtime:1.0.1.Option B: Install from PyPI
Install uv (
curl -LsSf https://astral.sh/uv/install.sh | sh), then:Then start the frontend and a worker as shown above. See the full installation guide for system dependencies and backend-specific notes.
Option C: Kubernetes (recommended)
For production multi-node clusters, install the Dynamo Platform and deploy with a single manifest:
Pre-built recipes for common models:
See recipes/ for the full list. Cloud-specific guides: AWS EKS · Google GKE
Building from Source
For contributors who want to build and develop locally. See the full build guide for details.
Community and Contributing
Dynamo is built in the open with an OSS-first development model. We welcome contributions of all kinds.
Latest News
Older news
Dynamo provides comprehensive benchmarking tools:
Frontend OpenAPI Specification
The OpenAI-compatible frontend exposes an OpenAPI 3 spec at
/openapi.json. To generate without running the server:This writes to
docs/reference/api/openapi.json.Service Discovery and Messaging
Dynamo uses TCP for inter-component communication. On Kubernetes, native resources (CRDs + EndpointSlices) handle service discovery. External services are optional for most deployments:
--discovery-backend file; vLLM also needs--kv-events-config '{"enable_kv_cache_events": false}'For Slurm or other distributed deployments (and KV-aware routing):
./etcd.nats-server -js.To quickly setup both:
docker compose -f deploy/docker-compose.yml up -dMore News
Reference