The Situation
I’m currently pushing the limits of Nemotron-3-Nano 30B on a new 4x RTX PRO 6000 Blackwell (SM120) workstation setup. My goal is reliable 400K+ context retrieval, but I’ve hit a surprising roadblock: NVIDIA’s official vLLM containers are significantly underperforming the community build, and in some cases, failing entirely.
I’m reaching out to both the NVIDIA team and the community to see if anyone else on the SM120 architecture (Blackwell Workstation/Consumer) has found a way to bridge this gap.
System Environment
-
GPUs: 4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each | 384GB Total)
-
Compute Capability: SM120
-
Driver: 580.105.08
-
OS: Fedora 43 | CPU: AMD Ryzen Threadripper PRO 9965WX | RAM: 2TB
-
Model:
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
The Performance Gap
While I can hit 400K tokens successfully on the community build, the official NVIDIA 25.12 container consistently fails once I pass the 370K mark.
| Context | Community vLLM (0.14.0rc1) | NVIDIA Container (25.12) |
|---|---|---|
| 200K – 370K | PASS | PASS |
| 380K | Not tested | FAIL (Partial/Hallucinated) |
| 400K | PASS (Native vLLM + FlashInfer) | FAIL (Empty response) |
| 450K | FAIL | FAIL |
Detailed Testing Log (NVIDIA Container 25.12-py3)
I’ve spent the last few days systematically trying to fix the retrieval issues in the 25.12 container. None of these attempts resolved the 400K failure:
| Attempt | Configuration Change | Result |
|---|---|---|
| 1 | Default settings (-p 8000:8000) |
Connection refused (Port mapping issue) |
| 2 | Default + --network=host |
FAIL: 400K Empty response |
| 3 | VLLM_ATTENTION_BACKEND=FLASHINFER |
FAIL: Performance caps at ~370K |
| 4 | VLLM_FLASH_ATTN_VERSION=2 |
FAIL: 400K Retrieval Failure |
| 5 | VLLM_FLASHINFER_MOE_BACKEND=throughput |
FAIL: 400K Retrieval Failure |
| 6 | VLLM_USE_FLASHINFER_MOE_FP8=1 |
FAIL: 400K Retrieval Failure |
| 7 | All above + --hf-overrides max_position_embeddings |
FAIL: 400K Retrieval Failure |
Earlier Session Notes: I also tried NCCL tuning (NCCL_ALGO=Ring, NCCL_PROTO=Simple), but this was catastrophic—it reduced functional context to 0K.
Critical Breakages & Observations
1. Container 25.09-py3 is DOA
The 25.09 container crashes immediately on model load with:
AttributeError: ‘NemotronHConfig’ object has no attribute ‘rms_norm_eps’
File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/nemotron_h.py”, line 170
2. SM120 vs SM100
I suspect the official containers are highly optimized for Blackwell Server (SM100/101), but those optimizations aren’t translating to the SM120 architecture used in Workstation/Consumer Blackwell cards.
3. Missing MoE Kernels
The logs explicitly mention missing optimized MoE kernel configs for this specific hardware:
Using default MoE config. Performance might be sub-optimal! Config file not found at [.../E=128,N=464,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Workstation_Edition.json]
Appeal to the Community & NVIDIA
To the Community:
-
Is anyone else running Blackwell Workstation(s) (RTX PRO 6000) or Consumer (RTX 50-series) cards? Have you noticed a similar context ceiling?
-
Has anyone successfully generated or located the missing
.jsonMoE kernel configs for SM120? -
Are there any specific FlashInfer or NCCL flags you’ve found that stabilize retrieval beyond 400K?
To NVIDIA:
-
Are there plans to align the container vLLM version (currently 0.11.1) with the community (0.14.0rc1)?
-
Can you clarify if SM120 is receiving the same optimization attention as SM100 in the NGC containers?
-
Can we get a recommended reference configuration for Nemotron-3-Nano on Blackwell Workstation hardware?
Current Status: I’m sticking with Community vLLM 0.14.0rc1 + CUDA 12.8 as it’s the only way to get 400K context. This feels backwards given that I’m using NVIDIA hardware and an NVIDIA-authored model.
Testing Date: January 5, 2026