Has anybody seen this recent paper from NVIDIA? Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs - [2511.16664] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs
There’s a model up on HF - nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 · Hugging Face (Note that there are FP8 and NVFP4 quants available.) Each is a 30B, 23B, and 13B model in the same package. And the entire package is the size of the largest model. Kinda reminds me of those Doublemint commercials - “it’s three, three, three models in one!”
It seems there are two aspects to this that are interesting:
- The ability to “slice” a model checkpoint from the base, and run it, and
- The as-yet-unmet-promise of being able to run the full model and have the engine choose which variant to use during different phases of inference via something they call “Elastic Budget Control”.
I guess if you’re bandwidth limited only having to download a single model is interesting. Might also be useful in a llama-swap context.
However, that second one is most interesting, especially in our limited resources world. If we can run a single model in a hybrid mode we may see significant impacts on real-world performance in a variety of applications. From the model card:
⚠️ Note on inference support. Elastic budget control is not yet supported in the standard vLLM inference engine — switching nested sub-models within a single generation (e.g., 23B → 30B think → answer) currently requires a custom inference path. Nested models preserve the Mamba and attention layer structure, enabling cache-state transplantation between models, and efficient native vLLM integration is actively being worked on .
Looking forward to seeing this supported in vLLM. For those who know far more than me, what might that take?