NVIDIA Nemotron Elastic

Has anybody seen this recent paper from NVIDIA? Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs - [2511.16664] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

There’s a model up on HF - nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 · Hugging Face (Note that there are FP8 and NVFP4 quants available.) Each is a 30B, 23B, and 13B model in the same package. And the entire package is the size of the largest model. Kinda reminds me of those Doublemint commercials - “it’s three, three, three models in one!”

It seems there are two aspects to this that are interesting:

  1. The ability to “slice” a model checkpoint from the base, and run it, and
  2. The as-yet-unmet-promise of being able to run the full model and have the engine choose which variant to use during different phases of inference via something they call “Elastic Budget Control”.

I guess if you’re bandwidth limited only having to download a single model is interesting. Might also be useful in a llama-swap context.

However, that second one is most interesting, especially in our limited resources world. If we can run a single model in a hybrid mode we may see significant impacts on real-world performance in a variety of applications. From the model card:

⚠️ Note on inference support. Elastic budget control is not yet supported in the standard vLLM inference engine — switching nested sub-models within a single generation (e.g., 23B → 30B think → answer) currently requires a custom inference path. Nested models preserve the Mamba and attention layer structure, enabling cache-state transplantation between models, and efficient native vLLM integration is actively being worked on .

Looking forward to seeing this supported in vLLM. For those who know far more than me, what might that take?