Nemotron-3-Nano 30B long context retrieval fails on 4 x RTX PRO 6000 (SM120) - NVIDIA vLLM containers perform worse than community vLLM

The Situation

I’m currently pushing the limits of Nemotron-3-Nano 30B on a new 4x RTX PRO 6000 Blackwell (SM120) workstation setup. My goal is reliable 400K+ context retrieval, but I’ve hit a surprising roadblock: NVIDIA’s official vLLM containers are significantly underperforming the community build, and in some cases, failing entirely.

I’m reaching out to both the NVIDIA team and the community to see if anyone else on the SM120 architecture (Blackwell Workstation/Consumer) has found a way to bridge this gap.

System Environment

  • GPUs: 4x NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96GB GDDR7 each | 384GB Total)

  • Compute Capability: SM120

  • Driver: 580.105.08

  • OS: Fedora 43 | CPU: AMD Ryzen Threadripper PRO 9965WX | RAM: 2TB

  • Model: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16


The Performance Gap

While I can hit 400K tokens successfully on the community build, the official NVIDIA 25.12 container consistently fails once I pass the 370K mark.

Context Community vLLM (0.14.0rc1) NVIDIA Container (25.12)
200K – 370K PASS PASS
380K Not tested FAIL (Partial/Hallucinated)
400K PASS (Native vLLM + FlashInfer) FAIL (Empty response)
450K FAIL FAIL

Detailed Testing Log (NVIDIA Container 25.12-py3)

I’ve spent the last few days systematically trying to fix the retrieval issues in the 25.12 container. None of these attempts resolved the 400K failure:

Attempt Configuration Change Result
1 Default settings (-p 8000:8000) Connection refused (Port mapping issue)
2 Default + --network=host FAIL: 400K Empty response
3 VLLM_ATTENTION_BACKEND=FLASHINFER FAIL: Performance caps at ~370K
4 VLLM_FLASH_ATTN_VERSION=2 FAIL: 400K Retrieval Failure
5 VLLM_FLASHINFER_MOE_BACKEND=throughput FAIL: 400K Retrieval Failure
6 VLLM_USE_FLASHINFER_MOE_FP8=1 FAIL: 400K Retrieval Failure
7 All above + --hf-overrides max_position_embeddings FAIL: 400K Retrieval Failure

Earlier Session Notes: I also tried NCCL tuning (NCCL_ALGO=Ring, NCCL_PROTO=Simple), but this was catastrophic—it reduced functional context to 0K.


Critical Breakages & Observations

1. Container 25.09-py3 is DOA

The 25.09 container crashes immediately on model load with:

AttributeError: ‘NemotronHConfig’ object has no attribute ‘rms_norm_eps’

File “/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/nemotron_h.py”, line 170

2. SM120 vs SM100

I suspect the official containers are highly optimized for Blackwell Server (SM100/101), but those optimizations aren’t translating to the SM120 architecture used in Workstation/Consumer Blackwell cards.

3. Missing MoE Kernels

The logs explicitly mention missing optimized MoE kernel configs for this specific hardware:

Using default MoE config. Performance might be sub-optimal! Config file not found at [.../E=128,N=464,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Workstation_Edition.json]

Appeal to the Community & NVIDIA

To the Community:

  • Is anyone else running Blackwell Workstation(s) (RTX PRO 6000) or Consumer (RTX 50-series) cards? Have you noticed a similar context ceiling?

  • Has anyone successfully generated or located the missing .json MoE kernel configs for SM120?

  • Are there any specific FlashInfer or NCCL flags you’ve found that stabilize retrieval beyond 400K?

To NVIDIA:

  • Are there plans to align the container vLLM version (currently 0.11.1) with the community (0.14.0rc1)?

  • Can you clarify if SM120 is receiving the same optimization attention as SM100 in the NGC containers?

  • Can we get a recommended reference configuration for Nemotron-3-Nano on Blackwell Workstation hardware?

Current Status: I’m sticking with Community vLLM 0.14.0rc1 + CUDA 12.8 as it’s the only way to get 400K context. This feels backwards given that I’m using NVIDIA hardware and an NVIDIA-authored model.

Testing Date: January 5, 2026

ADDITIONAL DIAGNOSTIC INFORMATION

================================================================================
COMPLETE SOFTWARE VERSIONS

Native vLLM Environment (Working - 400K context):
vLLM: 0.14.0rc1.dev221+g97a01308e.cu128 (built from source)
PyTorch: 2.9.1+cu128
FlashInfer: 0.5.3
NCCL: 2.27.5
Triton: 3.5.1
cuDNN: 9.10.2.21
CUDA Toolkit: 12.8.61
Python: 3.11.11

NVIDIA Container 25.12-py3:
Image Digest: sha256:02ae0d001d8f301b5e10ddb…
Created: 2025-12-17
vLLM: 0.11.1
CUDA: 13.1

NVIDIA Container 25.09-py3:
Image Digest: sha256:f1bc0ef9676a…
vLLM: 0.10.1
CUDA: 13.0

================================================================================
FULL SYSTEM SPECIFICATIONS

OS: Fedora Linux 43 (Workstation Edition)
Kernel: 6.17.9-300.fc43.x86_64
CPU: AMD Ryzen Threadripper PRO 9965WX 24-Cores (48 threads)
RAM: 2.0 TiB
GPU Driver: 580.105.08 (NVIDIA UNIX Open Kernel Module)
Driver CUDA: 13.0
Toolkit CUDA: 12.8.61

GPU Details (x4):
Model: NVIDIA RTX PRO 6000 Blackwell Workstation Edition
VRAM: 97,887 MiB each (391.5 GB total)
Compute: SM 12.0 (SM120)
Power: 450W each
SM Clock: 3090 MHz max
Mem Clock: 14001 MHz max

================================================================================
MODEL CONFIGURATION

{
“model_type”: “nemotron_h”,
“architectures”: [“NemotronHForCausalLM”],
“max_position_embeddings”: 262144,
“hidden_size”: 2688,
“num_hidden_layers”: 52,
“num_attention_heads”: 32,
“num_key_value_heads”: 2,
“vocab_size”: 131072,
“torch_dtype”: “bfloat16”
}

Note: Model’s native max_position_embeddings is 262144 (256K). We override to 1048576 (1M) using --hf-overrides.

================================================================================
EXACT NATIVE LAUNCH COMMAND (WORKING)

source ~/vllm-nemotron-env/bin/activate

VLLM_ATTENTION_BACKEND=FLASHINFER
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
–tensor-parallel-size 4
–max-model-len 1048576
–hf-overrides ‘{“max_position_embeddings”: 1048576}’
–mamba-ssm-cache-dtype float32
–gpu-memory-utilization 0.95
–enforce-eager
–trust-remote-code

================================================================================
NEEDLE-IN-HAYSTACK TEST SCRIPT

import requests, random, string, time

def test_needle(target_tokens):
needle = “NK_” + ‘’.join(random.choices(string.ascii_uppercase + string.digits, k=8))
filler = “The quick brown fox jumps over the lazy dog. "
num_fillers = int(target_tokens / 11)
haystack = “CONTEXT_START " + (filler * num_fillers) + " CONTEXT_END”
insert_pos = int(len(haystack) * 0.5)
haystack = haystack[:insert_pos] + f” [SECRET_KEY: {needle}] " + haystack[insert_pos:]

  start = time.time()
  response = requests.post("http://localhost:8000/v1/chat/completions",
      json={"model": "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16",
            "messages": [{"role": "system", "content": "Return ONLY the requested value."},
                        {"role": "user", "content": f"{haystack}\n\nWhat is [SECRET_KEY: ...]? Return ONLY the value:"}],
            "max_tokens": 100, "temperature": 0.0}, timeout=300)
  content = response.json()["choices"][0]["message"]["content"]
  if "</think>" in content: content = content.split("</think>")[-1].strip()
  passed = needle in content
  print(f"Needle: {needle} | Response: {content[:50]} | {time.time()-start:.1f}s | {'PASS' if passed else 'FAIL'}")
  return passed

Test at specific context length

test_needle(400000)

================================================================================
CRITICAL: NCCL TUNING BREAKS RETRIEVAL

We tested NCCL performance tuning and found it BREAKS long context retrieval:

DEFAULT NCCL SETTINGS: 400K = PASS ✓
NCCL_ALGO=Ring: 400K = FAIL ✗
NCCL_PROTO=Simple: 400K = FAIL ✗
NCCL_ALGO=Ring + Simple: 400K = FAIL ✗

Do not use NCCL tuning for long context on SM120.

================================================================================
GPU MEMORY UTILIZATION

At 400K context (native vLLM):
GPU 0: 94,002 MiB / 97,887 MiB (96.0%)
GPU 1: 95,381 MiB / 97,887 MiB (97.4%)
GPU 2: ~95,000 MiB / 97,887 MiB
GPU 3: ~95,000 MiB / 97,887 MiB

Memory is near saturation at 400K, explaining why 450K fails.

================================================================================
RELEVANT PYTHON PACKAGES

cuda-python 13.1.1
flashinfer-python 0.5.3
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cudnn-cu12 9.10.2.21
nvidia-nccl-cu12 2.27.3
torch 2.9.1
triton 3.5.1
vllm 0.14.0rc1.dev221+g97a01308e.cu128

================================================================================

Hi mcole94,

I do not have experience with RTX 6K’s, so I reached out to the team.

They recommend:

Using 25.12.post1 over 25.12. The former has 0.12.0 version compared to the latter’s 0.11.1. This might be related to a bug that the team has encountered and corrected in 25.12.post1.

Try FP8 weights + FP8 cache, which might help with 1M.

Let me know how that goes for you, and I will continue to check in with the team for any updates.

Thanks,

Aharpster

Did you tell them we are running 4 rtx pro blackwells?

Hi @Aharpster,

Thank you for the suggestions. We tested both the 25.12.post1 container and FP8 weights with FP8 KV cache - same behavior at 450K+.

What worked for us: disabling thinking mode.

–default-chat-template-kwargs ‘{“enable_thinking”: false}’

With this flag, we’re getting 100% retrieval accuracy up to 1M tokens on our 4x RTX PRO 6000 setup.

Thank you for your help and prompt response.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.