Hi all,
I’ve been working through benchmarking LLM inference on Jetson AGX Thor and wanted to share results and ask for guidance, particularly around MoE workloads.
Environment
-
Jetson AGX Thor Dev Kit
-
Ubuntu 24.04.4 LTS
-
Kernel: 6.8.12-tegra
-
CUDA: 13.0
-
Driver: 580
-
Container:
nvcr.io/nvidia/vllm:26.02-py3 -
vLLM: 0.15.1
-
Power mode: MAXN
-
Clocks locked (
jetson_clocks)
Benchmark Setup
-
Input: ~2048 tokens
-
Output: ~128 tokens
-
Dataset: random synthetic prompts
-
Modes:
-
C1 (single request)
-
C8 (8 concurrent requests)
-
Each test was repeated until results stabilized (~5% variance).
Results (stable runs)
Llama 3.1 8B (Dense)
-
C1: ~45 tok/s
-
C8: ~270 tok/s
This is consistent with or better than published Thor references.
Qwen 3 30B-A3B (MoE)
-
C1: ~34 tok/s (vs ~61 expected)
-
C8: ~96 tok/s (vs ~226 expected)
Mixtral-8x7B (MoE)
-
C1: ~7 tok/s
-
C8: ~14 tok/s
Performance remained consistently low across repeated runs.
Observations
-
Warmup alone was not sufficient — multiple runs were required for stabilization
-
Dense models scale well under concurrency
-
MoE models show reduced throughput and higher latency, especially under concurrency
Relevant Log Output
From server logs:
Using default MoE config. Performance might be sub-optimal!
Config file not found at:
.../fused_moe/configs/E=128,N=768,device_name=NVIDIA_Thor.json
Not enough SMs to use max_autotune_gemm mode
Using TRITON backend for Unquantized MoE
Question / Request for Guidance
Based on the above, I wanted to ask:
-
Are Thor-specific fused MoE configuration files expected to be included in the vLLM container (26.02 or later)?
-
Is Triton currently the intended backend for MoE workloads on Thor?
-
Are there recommended runtime flags or tuning parameters for improving MoE performance on this platform?
-
Are there known limitations or best practices when running MoE models on Thor vs Dense models?
Notes
I want to be careful not to draw incorrect conclusions — the intent here is to understand whether the observed MoE behavior is expected given the current software stack, or if there are configuration steps I may be missing.
Happy to provide additional logs or full benchmark details if helpful.
Thanks in advance for any guidance.
thor_vllm_benchmark_playbook_fixed_v3_with_results_add.docx (207.9 KB)
benchresult.txt (1.5 KB)