I’ve been testing the community versions of MLC and llama.cpp on Jetson Thor, and noticed a significant performance gap during prefill. I’d like to check if this is expected behavior or if there are optimization options I might have missed.
Model setup: Qwen3-30B-A3B (tested with Q4_K_M and q4bf16_1/q4f16_1 quantized variants)
Hardware: Jetson Thor
Performance comparison:
llama.cpp: Prefill for ~10k tokens takes about 30–40 seconds
MLC: Same setup requires 1.5× longer for prefill
On the other hand, MLC seems much faster during decode (roughly 3× faster than llama.cpp)
Stability:
llama.cpp server sometimes reports illegal memory access errors
MLC is more stable, but the prefill speed gap is quite large
Additional note: As far as I can tell, MLC currently does not support FP8 activation yet.
My questions are:
Is this prefill slowdown mainly due to MLC’s framework design, or lack of optimization/adaptation for Jetson Thor?
Are there recommended build flags, runtime parameters, or configuration tweaks to improve prefill performance?
If there are known issues or a roadmap for improvements, I’d really appreciate any pointers.
Yes—those numbers line up with what we see. With Qwen3-30B + q4bf16 on the simple sample, MLC’s short-completion decode throughput (~72 tok/s for 500 tokens in ~6.9 s) is roughly 3× our llama.cpp baseline on the same setup.
However, for long-context, prefill-dominated workloads (e.g., summarizing ~10k-token inputs), MLC is noticeably slower end-to-end on our side: ~2+ minutes for the job, whereas llama.cpp finishes in ~40 seconds under the same settings.
Thanks for checking. Yes—we’ve tried Qwen Coder 30B on a Jetson Orin 64GB side-by-side with MLC and llama.cpp, and our results match the report you referenced.
Our replication (single request, ~7,000-token input; short completion):
llama.cpp (q4_k_m):23–26 s first token **
MLC-LLM (q4f16_1):45–52 s first token **
That’s roughly a 2× slower prefill runtime for MLC on this workload.
Per the GitHub issue, we also tried prefill_chunk_size=4096 (vs. 2048 default). The improvement was marginal on our Orin setup.
Does the vLLM work for you? It seems that vLLM also supports FP8 activation.
We have an official vLLM container (from NGC) and can reach 226.42 tokens/sec. (But we don’t have the prefill time data)
Thanks a lot! I haven’t tried vLLM on Thor yet—previously I didn’t see an official image. In the link you shared, the NVIDIA Google Drive says I need permission. Could you please approve access?
If vLLM on Thor can reach ~226.42 tok/s decode and a reasonable prefill speed, that’s perfectly acceptable for our current needs. We only have one Thor device on hand (more are on the way). Once I get the additional machines —I’ll run tests, perform a detailed system analysis, and share the findings and improvement strategies here.