Jetson AGX Thor + vLLM (26.02): MoE performance significantly below reference — missing fused MoE config?

Hi all,

I’ve been working through benchmarking LLM inference on Jetson AGX Thor and wanted to share results and ask for guidance, particularly around MoE workloads.


Environment

  • Jetson AGX Thor Dev Kit

  • Ubuntu 24.04.4 LTS

  • Kernel: 6.8.12-tegra

  • CUDA: 13.0

  • Driver: 580

  • Container: nvcr.io/nvidia/vllm:26.02-py3

  • vLLM: 0.15.1

  • Power mode: MAXN

  • Clocks locked (jetson_clocks)


Benchmark Setup

  • Input: ~2048 tokens

  • Output: ~128 tokens

  • Dataset: random synthetic prompts

  • Modes:

    • C1 (single request)

    • C8 (8 concurrent requests)

Each test was repeated until results stabilized (~5% variance).


Results (stable runs)

Llama 3.1 8B (Dense)

  • C1: ~45 tok/s

  • C8: ~270 tok/s

This is consistent with or better than published Thor references.


Qwen 3 30B-A3B (MoE)

  • C1: ~34 tok/s (vs ~61 expected)

  • C8: ~96 tok/s (vs ~226 expected)


Mixtral-8x7B (MoE)

  • C1: ~7 tok/s

  • C8: ~14 tok/s

Performance remained consistently low across repeated runs.


Observations

  • Warmup alone was not sufficient — multiple runs were required for stabilization

  • Dense models scale well under concurrency

  • MoE models show reduced throughput and higher latency, especially under concurrency


Relevant Log Output

From server logs:

Using default MoE config. Performance might be sub-optimal!
Config file not found at:
.../fused_moe/configs/E=128,N=768,device_name=NVIDIA_Thor.json
Not enough SMs to use max_autotune_gemm mode
Using TRITON backend for Unquantized MoE

Question / Request for Guidance

Based on the above, I wanted to ask:

  1. Are Thor-specific fused MoE configuration files expected to be included in the vLLM container (26.02 or later)?

  2. Is Triton currently the intended backend for MoE workloads on Thor?

  3. Are there recommended runtime flags or tuning parameters for improving MoE performance on this platform?

  4. Are there known limitations or best practices when running MoE models on Thor vs Dense models?


Notes

I want to be careful not to draw incorrect conclusions — the intent here is to understand whether the observed MoE behavior is expected given the current software stack, or if there are configuration steps I may be missing.

Happy to provide additional logs or full benchmark details if helpful.


Thanks in advance for any guidance.

thor_vllm_benchmark_playbook_fixed_v3_with_results_add.docx (207.9 KB)

benchresult.txt (1.5 KB)

Hi,

Could you share the benchmark data you refer to with us first?

In our Jetson AI Lab’s table, we don’t have Mixtral-8x7B results.
But there are Qwen3.5 35B-A3B (MoE) and Qwen3 30B-A3B (MoE).

1.
Suppose no, but you can find the exact command we use for benchmarking in the link below:

For example, the command we use for Qwen MoE is:

$ sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --gpu-memory-utilization 0.8 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

2.
Usually, we use the vLLM container directly.

3.
Please check above link so you can find the exact flag we use for each model.

4.
Please try to run the model with the command shared in the above link.
For example, the perf results of Qwen3 MoE model is:

  • Qwen3 30B-A3B (c=1): 67 tok/s
  • Qwen3 30B-A3B (c=8): 242 tok/s

This should match the expected performance.

Thanks.

Hello AastaLLL

Check the two atttachments in the orginal thread

I see alot of useful info to try a rerun with
Jetson-specific container: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor and options used in the test.

Regards

================================================================================================================================
  Results vs NVIDIA Published
================================================================================================================================
  Model        Run    C1 Sustained     NVIDIA C1    Delta  C8 Sustained     NVIDIA C8    Delta  C1 TTFT ms
  ------------  ------  ----------------  ------------  ------  ----------------  ------------  ------  ------------
  llama8       R1     33.92 tok/s      44.00        -22.9% 150.39 tok/s     244.00       -38.4% 756.23
  llama8       R2     45.19 tok/s      44.00        +2.7%  266.41 tok/s     244.00       +9.2%  39.58
  llama8       R3     45.33 tok/s      44.00        +3.0%  265.81 tok/s     244.00       +8.9%  39.54
  qwen30       R1     58.57 tok/s      67.00        -12.6% 176.98 tok/s     242.00       -26.9% 477.79
  qwen30       R2     78.63 tok/s      67.00        +17.4% 260.57 tok/s     242.00       +7.7%  42.20
  qwen30       R3     81.38 tok/s      67.00        +21.5% 262.26 tok/s     242.00       +8.4%  41.93
  qwen32       R1     10.11 tok/s      13.19        -23.4% 42.27 tok/s      79.10        -46.6% 2994.17
  qwen32       R2     13.13 tok/s      13.19        -0.5%  85.61 tok/s      79.10        +8.2%  103.92
  qwen32       R3     13.12 tok/s      13.19        -0.5%  85.78 tok/s      79.10        +8.4%  103.46
  qwen35       R1     28.49 tok/s      35.00        -18.6% 128.05 tok/s     125.00       +2.4%  471.93
  qwen35       R2     30.07 tok/s      35.00        -14.1% 151.99 tok/s     125.00       +21.6% 236.65
  qwen35       R3     30.06 tok/s      35.00        -14.1% 151.99 tok/s     125.00       +21.6% 236.71
  llama70      R1     4.87 tok/s       6.27         -22.3% 20.67 tok/s      41.50        -50.2% 6418.73
  llama70      R2     6.39 tok/s       6.27         +1.9%  44.49 tok/s      41.50        +7.2%  184.44
  llama70      R3     6.39 tok/s       6.27         +1.9%  44.52 tok/s      41.50        +7.3%  183.51
  gptoss120    R1     20.12 tok/s      ?.??         n/a    73.77 tok/s      ?.??         n/a    907.50
  gptoss120    R2     29.86 tok/s      ?.??         n/a    105.29 tok/s     ?.??         n/a    64.12
  gptoss120    R3     30.95 tok/s      ?.??         n/a    97.84 tok/s      ?.??         n/a    62.50
================================================================================================================================

Note Qwen3.5 35B-A3B is a Mixture-of-Experts (MoE) **Speculative Decoding with MTP does not work.

Summary**

Multi-Token Prediction (MTP) speculative decoding fails to load on the NVFP4-quantized Qwen3.5-35B-A3B checkpoint published at Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 when using the official NVIDIA Jetson Thor vLLM container ( Package vllm · GitHub ). The failure is caused by a weight namespace mismatch in the checkpoint’s safetensors file: all weights are prefixed with ‘language_model.’ but vLLM’s Qwen3_5MoeMTP model loader expects them without this prefix.

Additionally, the checkpoint appears to contain Mamba/SSM architecture weights (linear_attn, A_log, conv1d, dt_bias) rather than standard Qwen3.5 MoE attention weights, suggesting the safetensors file may be from an entirely different model.

This is particularly impactful because NVIDIA’s own Jetson AI Lab page ( Qwen3.5 35B-A3B (MoE) | Jetson AI Lab ) explicitly advertises MTP speculative decoding as a feature for this exact model/container combination, yet it cannot be made to work with the published checkpoint.

Hi,

Which command do you use?
Do you try the one shared in the tutorial?

sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --gpu-memory-utilization 0.8 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

For the failure, do you have the error message or log that you can share with us?

Thanks.

Hi AastaLLL,

Thank you for the guidance. I followed your recommendation to switch to the Jetson Thor container ( Package vllm · GitHub ) and repeated the full benchmark suite using a 3-run stabilization protocol (warmup + 3 runs at C1 and C8, ISL/OSL 2048/128). The results improved dramatically over my original NGC container runs.


VALIDATED RESULTS (R3 — stable run, Jetson Thor container)

Qwen3 30B-A3B (W4A16)

  • C1: 81.38 tok/s (+21.5% vs your reference of 67 tok/s)
  • C8: 262.26 tok/s (+8.4% vs your reference of 242 tok/s)

Llama 3.1 8B (W4A16)

  • C1: 45.33 tok/s (+3.0% vs reference of 44 tok/s)
  • C8: 265.81 tok/s (+8.9% vs reference of 244 tok/s)

Qwen3 32B (W4A16)

  • C1: 13.12 tok/s (-0.5% vs reference of 13.19 tok/s)
  • C8: 85.78 tok/s (+8.4% vs reference of 79.1 tok/s)

GPT OSS 120B (NVFP4, Thor-exclusive)

  • C1: 30.95 tok/s (no published reference found)
  • C8: 97.84 tok/s (no published reference found)

Key finding on benchmark methodology: R1 (first run after warmup) consistently underperforms by 12-23% due to cold-start effects — KV cache empty, GPU not fully engaged. R2 and R3 stabilize and match or exceed published references. Single-run benchmarks on this platform will produce misleading results regardless of which container is used.


QWEN3.5 35B-A3B NVFP4 CHECKPOINT ISSUE

During this process I identified two separate problems with the Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 checkpoint.

Issue 1 — Basic serve command runs but does not achieve published performance

The Jetson AI Lab command without MTP flags loads and serves without crashing:

vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4
–gpu-memory-utilization 0.8 --enable-prefix-caching
–reasoning-parser qwen3
–enable-auto-tool-choice --tool-call-parser qwen3_coder

However the performance does not match your published reference of 35 tok/s C1 / 125 tok/s C8.

Our R3 results with this checkpoint and command:

  • C1: 30.06 tok/s (-14.1% vs reference)
  • C8: 151.99 tok/s (+21.6% vs reference)

Server logs confirm bf16 fallback is occurring rather than true NVFP4 quantized inference. This is consistent with the weight namespace issue described below preventing correct NVFP4 loading.

Issue 2 — MTP speculative decoding causes hard crash

Adding the MTP flag as advertised on the Jetson AI Lab page:

–speculative-config ‘{“method”: “mtp”, “num_speculative_tokens”: 4}’

Causes a hard crash:

ValueError: There is no module or parameter named ‘language_model’
in Qwen3_5MoeMTP.

Root cause: Inspection of the checkpoint’s safetensors file shows all 123,973 weight keys are prefixed with ‘language_model.’ but vLLM’s Qwen3_5MoeMTP loader expects them under the flat ‘model.’ namespace. The remap_weight_names() function in qwen3_5_mtp.py does not handle this prefix.

Additionally the checkpoint contains Mamba/SSM architecture weights (linear_attn, A_log, conv1d, dt_bias) that are inconsistent with the Qwen3.5 MoE attention architecture — suggesting the safetensors file may be from a different model entirely.

The Jetson AI Lab page explicitly advertises MTP speculative decoding for this exact checkpoint and container combination, but it cannot be made to work with the published weights.

Workaround: The unquantized base model (Qwen/Qwen3.5-35B-A3B) serves successfully and MTP loads without crashing. This trades NVFP4 quantization speed for a working MTP configuration. Our R3 results with the base model:

  • C1: 30.06 tok/s (bf16 fallback, expected to improve with correct NVFP4 checkpoint)
  • C8: 151.99 tok/s (+21.6% vs reference)

I have prepared a detailed bug report documenting the weight key structure, full traceback, and verification commands. Happy to share the full report or attach it here if useful to the NVIDIA team. I am also prepared to file issues at the Kbenkhaled HuggingFace repository and the jetson-containers GitHub if that would help get this resolved.

Thank you again for pointing me to the correct container path — the performance difference was dramatic and the methodology findings should be useful to other Thor users.

WayNo

qwen35_mtp_bug_report.docx (11.3 KB)

Hi,

Thanks for reporting this.

We are checking this issue with our internal team.
Will get back to you later.

Thanks.

Hi,

Thanks for reporting this.
Confirmed that we can see the same error when using Qwen3.5-35B-A3B-NVFP4 + MTP:

ValueError: There is no module or parameter named 'language_model' in Qwen3_5MoeMTP. 

We are now checking with the internal team for more information.
Will get back to you later.

Thanks.

1 Like