NVIDIA Nemotron Elastic

chick_webb · May 14, 2026, 6:16pm

Has anybody seen this recent paper from NVIDIA? Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs - [2511.16664] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

There’s a model up on HF - nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 · Hugging Face (Note that there are FP8 and NVFP4 quants available.) Each is a 30B, 23B, and 13B model in the same package. And the entire package is the size of the largest model. Kinda reminds me of those Doublemint commercials - “it’s three, three, three models in one!”

It seems there are two aspects to this that are interesting:

The ability to “slice” a model checkpoint from the base, and run it, and
The as-yet-unmet-promise of being able to run the full model and have the engine choose which variant to use during different phases of inference via something they call “Elastic Budget Control”.

I guess if you’re bandwidth limited only having to download a single model is interesting. Might also be useful in a llama-swap context.

However, that second one is most interesting, especially in our limited resources world. If we can run a single model in a hybrid mode we may see significant impacts on real-world performance in a variety of applications. From the model card:

⚠️ Note on inference support. Elastic budget control is not yet supported in the standard vLLM inference engine — switching nested sub-models within a single generation (e.g., 23B → 30B think → answer) currently requires a custom inference path. Nested models preserve the Mamba and attention layer structure, enabling cache-state transplantation between models, and efficient native vLLM integration is actively being worked on .

Looking forward to seeing this supported in vLLM. For those who know far more than me, what might that take?

Topic		Replies	Views
Testing Nemotron 3 Nano Models on Nvidia DGX Spark/Jetson Thor with vLLM and FlashInfer DGX Spark / GB10 jetson , nemotron	3	548	February 15, 2026
Advancing the Accuracy-Efficiency Frontier with Llama-3.1-Nemotron-51B Technical Blog llama	3	155	October 24, 2024
Testing NVIDIA-Nemotron-3-Nano-4B- Model on Nvidia DGX Spark/Jetson Thor/6000 Pro with vLLM DGX Spark / GB10 jetson , nemotron	1	236	March 22, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	84	3219	March 20, 2026
AI Reasoning with Llama Nemotron at GTC25 \| Announcements Announcements nim , llama , agentic-ai , llama-nemotron	0	208	March 18, 2025
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	2064	January 25, 2024
NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 DGX Spark / GB10 nemotron	89	9759	March 31, 2026
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	417	September 17, 2024
nvidia/Nemotron-Cascade-2-30B-A3B yet another model to test DGX Spark / GB10 nemotron	19	1585	March 24, 2026
Maximum model size to build TRT-LLM Engine on DGX Spark? DGX Spark / GB10 llama , nemotron	4	716	October 27, 2025

NVIDIA Nemotron Elastic

Related topics