Jetson AGX Thor + vLLM (26.02): MoE performance significantly below reference — missing fused MoE config?

waycore · March 25, 2026, 12:22am

Hi all,

I’ve been working through benchmarking LLM inference on Jetson AGX Thor and wanted to share results and ask for guidance, particularly around MoE workloads.

Environment

Jetson AGX Thor Dev Kit
Ubuntu 24.04.4 LTS
Kernel: 6.8.12-tegra
CUDA: 13.0
Driver: 580
Container: nvcr.io/nvidia/vllm:26.02-py3
vLLM: 0.15.1
Power mode: MAXN
Clocks locked (jetson_clocks)

Benchmark Setup

Input: ~2048 tokens
Output: ~128 tokens
Dataset: random synthetic prompts
Modes:
- C1 (single request)
- C8 (8 concurrent requests)

Each test was repeated until results stabilized (~5% variance).

Results (stable runs)

Llama 3.1 8B (Dense)

C1: ~45 tok/s
C8: ~270 tok/s

This is consistent with or better than published Thor references.

Qwen 3 30B-A3B (MoE)

C1: ~34 tok/s (vs ~61 expected)
C8: ~96 tok/s (vs ~226 expected)

Mixtral-8x7B (MoE)

C1: ~7 tok/s
C8: ~14 tok/s

Performance remained consistently low across repeated runs.

Observations

Warmup alone was not sufficient — multiple runs were required for stabilization
Dense models scale well under concurrency
MoE models show reduced throughput and higher latency, especially under concurrency

Relevant Log Output

From server logs:

Using default MoE config. Performance might be sub-optimal!
Config file not found at:
.../fused_moe/configs/E=128,N=768,device_name=NVIDIA_Thor.json

Not enough SMs to use max_autotune_gemm mode

Using TRITON backend for Unquantized MoE

Question / Request for Guidance

Based on the above, I wanted to ask:

Are Thor-specific fused MoE configuration files expected to be included in the vLLM container (26.02 or later)?
Is Triton currently the intended backend for MoE workloads on Thor?
Are there recommended runtime flags or tuning parameters for improving MoE performance on this platform?
Are there known limitations or best practices when running MoE models on Thor vs Dense models?

Notes

I want to be careful not to draw incorrect conclusions — the intent here is to understand whether the observed MoE behavior is expected given the current software stack, or if there are configuration steps I may be missing.

Happy to provide additional logs or full benchmark details if helpful.

Thanks in advance for any guidance.

thor_vllm_benchmark_playbook_fixed_v3_with_results_add.docx (207.9 KB)

benchresult.txt (1.5 KB)

AastaLLL · March 25, 2026, 3:40am

Hi,

Could you share the benchmark data you refer to with us first?

In our Jetson AI Lab’s table, we don’t have Mixtral-8x7B results.
But there are Qwen3.5 35B-A3B (MoE) and Qwen3 30B-A3B (MoE).

1.
Suppose no, but you can find the exact command we use for benchmarking in the link below:

For example, the command we use for Qwen MoE is:

$ sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --gpu-memory-utilization 0.8 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

2.
Usually, we use the vLLM container directly.

3.
Please check above link so you can find the exact flag we use for each model.

4.
Please try to run the model with the command shared in the above link.
For example, the perf results of Qwen3 MoE model is:

Qwen3 30B-A3B (c=1): 67 tok/s
Qwen3 30B-A3B (c=8): 242 tok/s

This should match the expected performance.

Thanks.

waycore · March 25, 2026, 3:20pm

Hello AastaLLL

Check the two atttachments in the orginal thread

I see alot of useful info to try a rerun with
Jetson-specific container: ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor and options used in the test.

Regards

waycore · March 27, 2026, 10:03pm

================================================================================================================================
  Results vs NVIDIA Published
================================================================================================================================
  Model        Run    C1 Sustained     NVIDIA C1    Delta  C8 Sustained     NVIDIA C8    Delta  C1 TTFT ms
  ------------  ------  ----------------  ------------  ------  ----------------  ------------  ------  ------------
  llama8       R1     33.92 tok/s      44.00        -22.9% 150.39 tok/s     244.00       -38.4% 756.23
  llama8       R2     45.19 tok/s      44.00        +2.7%  266.41 tok/s     244.00       +9.2%  39.58
  llama8       R3     45.33 tok/s      44.00        +3.0%  265.81 tok/s     244.00       +8.9%  39.54
  qwen30       R1     58.57 tok/s      67.00        -12.6% 176.98 tok/s     242.00       -26.9% 477.79
  qwen30       R2     78.63 tok/s      67.00        +17.4% 260.57 tok/s     242.00       +7.7%  42.20
  qwen30       R3     81.38 tok/s      67.00        +21.5% 262.26 tok/s     242.00       +8.4%  41.93
  qwen32       R1     10.11 tok/s      13.19        -23.4% 42.27 tok/s      79.10        -46.6% 2994.17
  qwen32       R2     13.13 tok/s      13.19        -0.5%  85.61 tok/s      79.10        +8.2%  103.92
  qwen32       R3     13.12 tok/s      13.19        -0.5%  85.78 tok/s      79.10        +8.4%  103.46
  qwen35       R1     28.49 tok/s      35.00        -18.6% 128.05 tok/s     125.00       +2.4%  471.93
  qwen35       R2     30.07 tok/s      35.00        -14.1% 151.99 tok/s     125.00       +21.6% 236.65
  qwen35       R3     30.06 tok/s      35.00        -14.1% 151.99 tok/s     125.00       +21.6% 236.71
  llama70      R1     4.87 tok/s       6.27         -22.3% 20.67 tok/s      41.50        -50.2% 6418.73
  llama70      R2     6.39 tok/s       6.27         +1.9%  44.49 tok/s      41.50        +7.2%  184.44
  llama70      R3     6.39 tok/s       6.27         +1.9%  44.52 tok/s      41.50        +7.3%  183.51
  gptoss120    R1     20.12 tok/s      ?.??         n/a    73.77 tok/s      ?.??         n/a    907.50
  gptoss120    R2     29.86 tok/s      ?.??         n/a    105.29 tok/s     ?.??         n/a    64.12
  gptoss120    R3     30.95 tok/s      ?.??         n/a    97.84 tok/s      ?.??         n/a    62.50
================================================================================================================================

Note Qwen3.5 35B-A3B is a Mixture-of-Experts (MoE) **Speculative Decoding with MTP does not work.

Summary**

Multi-Token Prediction (MTP) speculative decoding fails to load on the NVFP4-quantized Qwen3.5-35B-A3B checkpoint published at Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 when using the official NVIDIA Jetson Thor vLLM container ( Package vllm · GitHub ). The failure is caused by a weight namespace mismatch in the checkpoint’s safetensors file: all weights are prefixed with ‘language_model.’ but vLLM’s Qwen3_5MoeMTP model loader expects them without this prefix.

Additionally, the checkpoint appears to contain Mamba/SSM architecture weights (linear_attn, A_log, conv1d, dt_bias) rather than standard Qwen3.5 MoE attention weights, suggesting the safetensors file may be from an entirely different model.

This is particularly impactful because NVIDIA’s own Jetson AI Lab page ( Qwen3.5 35B-A3B (MoE) | Jetson AI Lab ) explicitly advertises MTP speculative decoding as a feature for this exact model/container combination, yet it cannot be made to work with the published checkpoint.

AastaLLL · March 30, 2026, 8:56am

Hi,

Which command do you use?
Do you try the one shared in the tutorial?

sudo docker run -it --rm --pull always --runtime=nvidia --network host ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 --gpu-memory-utilization 0.8 --enable-prefix-caching --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder

For the failure, do you have the error message or log that you can share with us?

Thanks.

waycore · March 30, 2026, 1:56pm

Hi AastaLLL,

Thank you for the guidance. I followed your recommendation to switch to the Jetson Thor container ( Package vllm · GitHub ) and repeated the full benchmark suite using a 3-run stabilization protocol (warmup + 3 runs at C1 and C8, ISL/OSL 2048/128). The results improved dramatically over my original NGC container runs.

VALIDATED RESULTS (R3 — stable run, Jetson Thor container)

Qwen3 30B-A3B (W4A16)

C1: 81.38 tok/s (+21.5% vs your reference of 67 tok/s)
C8: 262.26 tok/s (+8.4% vs your reference of 242 tok/s)

Llama 3.1 8B (W4A16)

C1: 45.33 tok/s (+3.0% vs reference of 44 tok/s)
C8: 265.81 tok/s (+8.9% vs reference of 244 tok/s)

Qwen3 32B (W4A16)

C1: 13.12 tok/s (-0.5% vs reference of 13.19 tok/s)
C8: 85.78 tok/s (+8.4% vs reference of 79.1 tok/s)

GPT OSS 120B (NVFP4, Thor-exclusive)

C1: 30.95 tok/s (no published reference found)
C8: 97.84 tok/s (no published reference found)

Key finding on benchmark methodology: R1 (first run after warmup) consistently underperforms by 12-23% due to cold-start effects — KV cache empty, GPU not fully engaged. R2 and R3 stabilize and match or exceed published references. Single-run benchmarks on this platform will produce misleading results regardless of which container is used.

QWEN3.5 35B-A3B NVFP4 CHECKPOINT ISSUE

During this process I identified two separate problems with the Kbenkhaled/Qwen3.5-35B-A3B-NVFP4 checkpoint.

Issue 1 — Basic serve command runs but does not achieve published performance

The Jetson AI Lab command without MTP flags loads and serves without crashing:

vllm serve Kbenkhaled/Qwen3.5-35B-A3B-NVFP4
–gpu-memory-utilization 0.8 --enable-prefix-caching
–reasoning-parser qwen3
–enable-auto-tool-choice --tool-call-parser qwen3_coder

However the performance does not match your published reference of 35 tok/s C1 / 125 tok/s C8.

Our R3 results with this checkpoint and command:

C1: 30.06 tok/s (-14.1% vs reference)
C8: 151.99 tok/s (+21.6% vs reference)

Server logs confirm bf16 fallback is occurring rather than true NVFP4 quantized inference. This is consistent with the weight namespace issue described below preventing correct NVFP4 loading.

Issue 2 — MTP speculative decoding causes hard crash

Adding the MTP flag as advertised on the Jetson AI Lab page:

–speculative-config ‘{“method”: “mtp”, “num_speculative_tokens”: 4}’

Causes a hard crash:

ValueError: There is no module or parameter named ‘language_model’
in Qwen3_5MoeMTP.

Root cause: Inspection of the checkpoint’s safetensors file shows all 123,973 weight keys are prefixed with ‘language_model.’ but vLLM’s Qwen3_5MoeMTP loader expects them under the flat ‘model.’ namespace. The remap_weight_names() function in qwen3_5_mtp.py does not handle this prefix.

Additionally the checkpoint contains Mamba/SSM architecture weights (linear_attn, A_log, conv1d, dt_bias) that are inconsistent with the Qwen3.5 MoE attention architecture — suggesting the safetensors file may be from a different model entirely.

The Jetson AI Lab page explicitly advertises MTP speculative decoding for this exact checkpoint and container combination, but it cannot be made to work with the published weights.

Workaround: The unquantized base model (Qwen/Qwen3.5-35B-A3B) serves successfully and MTP loads without crashing. This trades NVFP4 quantization speed for a working MTP configuration. Our R3 results with the base model:

C1: 30.06 tok/s (bf16 fallback, expected to improve with correct NVFP4 checkpoint)
C8: 151.99 tok/s (+21.6% vs reference)

I have prepared a detailed bug report documenting the weight key structure, full traceback, and verification commands. Happy to share the full report or attach it here if useful to the NVIDIA team. I am also prepared to file issues at the Kbenkhaled HuggingFace repository and the jetson-containers GitHub if that would help get this resolved.

Thank you again for pointing me to the correct container path — the performance difference was dramatic and the methodology findings should be useful to other Thor users.

WayNo

qwen35_mtp_bug_report.docx (11.3 KB)

AastaLLL · April 1, 2026, 6:32am

Hi,

Thanks for reporting this.

We are checking this issue with our internal team.
Will get back to you later.

Thanks.

AastaLLL · April 2, 2026, 6:33am

Hi,

Thanks for reporting this.
Confirmed that we can see the same error when using Qwen3.5-35B-A3B-NVFP4 + MTP:

ValueError: There is no module or parameter named 'language_model' in Qwen3_5MoeMTP.

We are now checking with the internal team for more information.
Will get back to you later.

Thanks.

Topic		Replies	Views
Announcing new VLLM container & 3.5X increase in Gen AI Performance in just 5 weeks of Jetson AGX Thor Launch Jetson Thor jetson , llama-31-8b-instruct , llama , deepseek , nemotron	46	3635	December 14, 2025
Performance Comparison of Qwen3-30B-A3B-AWQ on Jetson Thor vs Orin AGX 64GB Jetson Thor generative_ai	10	1466	September 25, 2025
Experiences running Qwen/Qwen3-Coder-Next? Jetson Thor inference-server-triton , generative_ai	11	1135	April 8, 2026
Thor开发板上测试vllm失败 Jetson Thor generative_ai	9	302	November 5, 2025
Recipes to run Qwen3.5 models on Thor Jetson Thor llm	1	296	March 16, 2026
JetPack 7.0/Jetson Linux 38.2 for NVIDIA Jetson Thor is now live Jetson Thor cudnn , llama	20	3428	October 27, 2025
Running Qwen3.5 35B-A3B (MoE) on the Thor module's self-developed carrier board, the machine automatically powers down Jetson Thor board-design , containers , generative_ai	5	185	March 24, 2026
Jetson Thor Official Container for vLLM 0.16 fails to load nemotron-3-super -- says mixed-precision quant config is unsupported in vLLM 0.16 container NVIDIA Nemotron jetson , nemotron	0	25	March 30, 2026
The token speed of qwen 2.5 vl 3b model is very lower on Jeston AGX Orin Jetson AGX Orin generative_ai	3	472	September 22, 2025
tensorrt推理Qwen/Qwen3-VL-8B-Instruct不兼容 Jetson Thor tensorrt , llm	4	306	October 27, 2025

Jetson AGX Thor + vLLM (26.02): MoE performance significantly below reference — missing fused MoE config?

Environment

Benchmark Setup

Results (stable runs)

Llama 3.1 8B (Dense)

Qwen 3 30B-A3B (MoE)

Mixtral-8x7B (MoE)

Observations

Relevant Log Output

Question / Request for Guidance

Notes

Related topics