NVIDIA DYNAMO FAQ

Q: What is NVIDIA Dynamo

NVIDIA Dynamo is an open source modular inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. It enables seamless scaling of inference workloads across GPU nodes and the dynamic scale up and down of GPU workers to address traffic bottlenecks. NVIDIA Dynamo also features LLM-specific capabilities, such as Disaggregated Serving, KV cache aware routing, KV offloading to multiple memory hierarchy and low latency data transfer between multi-nodes.

Q: What are the benefits of Dynamo?

Dynamo allows AI Service Providers to serve more inference requests per GPU, accelerate inference response time, reduce overall inference costs and accelerate time to market for deploying new AI models in production. .

Q: What are the key innovations of NVIDIA Dynamo

Dynamo introduces several key innovations including:

  • GPU Planner: A planning engine that dynamically adds and removes GPUs to adjust to fluctuating user demand, avoiding GPU over- or under-provisioning.
  • Smart Router: An LLM-aware router that directs requests across large GPU fleets to minimize costly GPU re-computations of repeat or overlapping requests.
  • Low latency Communication Libraries (NIXL): an inference optimized library that supports state-of-the-art GPU to GPU or communication and abstracts complexity of data exchange across heterogenous devices, accelerating data transfer
  • Memory Manager: An engine that intelligently offloads and reloads inference data to and from lower cost memory and storage devices without impacting user experience

Q: What frameworks are supported by NVIDIA Dynamo?

Dynemo supports all major LLM frameworks including vLLM, SGLang and TensorRT-LLM

Q: Who is the target audience for NVIDIA Dynamo

Developers, AI service providers, and enterprises deploying large scale AI clusters using RAG, reasoning, multi-turn, and/or agentic inference.

Q: Where can customers get Dynamo?

NVIDIA Dynamo is open source and will be available on GitHub starting around March 18, 2025. It will also be included in NVIDIA NIM.

Q: Is Dynamo replacing anything?

Dynamo is the successor of Triton, building upon its success and offering a new modular architecture designed to serve generative AI models in multi-node distributed environments.

Q: What if my customer has an existing inference stack and is only interested in certain components of Dynamo ?

Dynemo has a modular architecture allowing customers to preserve investments in their existing inference stack and choose the Dynemo components that best meet their IT and business requirements.

Q: How much performance gain should customers expect from running their workloads with NVIDIA Dynamo?

Performance gain highly depends upon deployment SLA and targeted model. In the case of Llama 70B, Dynamo was able to provide 2.3x and 2.6x increased throughput at fixed latency for H100 and B200 platforms respectively.

Q: Are there any NVIDIA partners that have endorsed Dynamo and expressed interest in deploying?

Cohere, Together AI and Perplexity AI have all endorsed Dynamo and shared their interest in leveraging it’s capabilities.

Q: What is NVIDIA NIXL?

NIXL is an asynchronous accelerated communications library that can rapidly transfer data between different types of memory and storage. It can use different transport types and network connections including NVlink, PCIe, InfiniBand and Spectrum-X. NIXL works as part of NVIDIA Dynamo, optimizing the data movements orchestrated by Dynamo.

Q: How can I learn more about NVIDIA Dynamo?