Prefill speed on MLC significantly slower than llama.cpp on Jetson Thor – any optimization suggestions?

Fry_Song · August 29, 2025, 10:40am

Hi everyone,

I’ve been testing the community versions of MLC and llama.cpp on Jetson Thor, and noticed a significant performance gap during prefill. I’d like to check if this is expected behavior or if there are optimization options I might have missed.

Model setup: Qwen3-30B-A3B (tested with Q4_K_M and q4bf16_1/q4f16_1 quantized variants)
Hardware: Jetson Thor
Performance comparison:
- llama.cpp: Prefill for ~10k tokens takes about 30–40 seconds
- MLC: Same setup requires 1.5× longer for prefill
- On the other hand, MLC seems much faster during decode (roughly 3× faster than llama.cpp)
Stability:
- llama.cpp server sometimes reports illegal memory access errors
- MLC is more stable, but the prefill speed gap is quite large

Additional note: As far as I can tell, MLC currently does not support FP8 activation yet.

My questions are:

Is this prefill slowdown mainly due to MLC’s framework design, or lack of optimization/adaptation for Jetson Thor?
Are there recommended build flags, runtime parameters, or configuration tweaks to improve prefill performance?
If there are known issues or a roadmap for improvements, I’d really appreciate any pointers.

Thanks a lot!

AastaLLL · September 1, 2025, 2:30am

Hi,

Could you share the steps you set up llama.cpp and MLC with us?
We will try to reproduce this issue internally and provide more info to you later.

Thanks

Fry_Song · September 1, 2025, 3:09am

Hi, thanks for the quick response!

Here are the steps I used to set up both llama.cpp and MLC on Jetson Thor:

MLC:
model:

Run container


sudo docker run -it --rm \
  --runtime nvidia \
  --gpus all \
  -v /workspace:/workspace \
  -p 6678:6678 \
  -p 6677:6677 \
  ghcr.io/nvidia-ai-iot/mlc:r38.2.arm64-sbsa-cu130-24.04 \

Convert weight:

mlc_llm convert_weight /workspace/models/Qwen3-30B-A3B-Instruct-2507/ \
    --quantization q4bf16_1 \
    --model-type qwen3_moe\ 
    --device cuda \ 
    --source-format huggingface-safetensor\
    -o /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4f16_1

gen_config:

mlc_llm gen_config \
    /workspace/models/Qwen3-30B-A3B-Instruct-2507/\
    --quantization q4bf16_1 \
    --conv-template qwen2\
    --context-window-size 32768\
    --prefill-chunk-size 4096 \
    --max-batch-size 3\
    --output /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1

compile:

mlc_llm compile /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/mlc-chat-config.json \
    --device cuda \
    -o /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/Qwen3-30B-A3B-Instruct-2507-q4bf16_1-cuda.so \
    --quantization q4f16_1 \
    --model-type qwen3_moe \
    --opt="cublas_gemm=1;cudagraph=1" \

Serve

mlc_llm serve /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1 \
  --port 6678 \
  --host 0.0.0.0 \
  --device cuda \
  --mode interactive \
  --model-lib /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/Qwen3-30B-A3B-Instruct-2507-q4bf16_1-cuda.so \
  --overrides "max_num_sequence=1;max_total_seq_length=32768;context_window_size=32768;gpu_memory_utilization=0.3"

Llama.cpp
model:

Run container

cd /workspace/llama.cpp/build/bin
./llama-server \
    -m /workspace/models/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf\
    --host "0.0.0.0" \
    --port 6678 \
    -ngl 99 \
    -c 32768

Serve

llama-server \
    -m /workspace/models/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf\
    --host "0.0.0.0" \
    --port 6678 \
    -ngl 99 \
    -c 32768

Thanks!

AastaLLL · September 2, 2025, 7:32am

Hi,

Thanks for sharing the details.

We have set up MLC and llama.cpp on Thor locally and are now downloading the model.
Will update the status from our side with you later.

Thanks.

Fry_Song · September 2, 2025, 7:47am

Hi, thanks for the update! Looking forward to your findings.

AastaLLL · September 3, 2025, 9:42am

Hi,

Do you have the client app that can measure the performance as well as the prefill time for MLC and llama.cpp?
If so, could you also share it with us?

We have converted the model into MLC and want to test it with the same app you used.
Thanks.

Fry_Song · September 3, 2025, 9:51am

Hi, thanks for the follow-up.

Our user scenario is mainly 1–3 concurrent personal agent sessions, where the common workload is either:

generating a ~2000-word story, or

summarizing a long document of around 10k tokens.

That’s the context in which we’ve been measuring performance (prefill time and throughput) for both MLC and llama.cpp.

AastaLLL · September 4, 2025, 9:09am

Hi,

We test MLC with the simple sample and can get below performance with Qwen3-30B+q4bf16.

Elapsed time: 6.94 sec
Completion tokens: 500
Approx. tokens/sec: 72.04

Is this the same as your testing?
Thanks.

Fry_Song · September 4, 2025, 9:22am

Hi,

Yes—those numbers line up with what we see. With Qwen3-30B + q4bf16 on the simple sample, MLC’s short-completion decode throughput (~72 tok/s for 500 tokens in ~6.9 s) is roughly 3× our llama.cpp baseline on the same setup.

However, for long-context, prefill-dominated workloads (e.g., summarizing ~10k-token inputs), MLC is noticeably slower end-to-end on our side: ~2+ minutes for the job, whereas llama.cpp finishes in ~40 seconds under the same settings.

Thanks for the quick follow-up —much appreciated!

AastaLLL · September 5, 2025, 5:10am

Hi,

Have you tried other models or on other devices (ex., Orin or x86)?
It seems that the Qwen model runs slower on the MLC:

github.com/mlc-ai/mlc-llm

[Speed] MLC is much slower than Ollama when running Qwen Coder 30b

opened 03:54AM - 21 Aug 25 UTC

capyun

performance

# 🏎️ Speed Report - The model code: Qwen Coder 30b, and I use q4f16_1 quant…ization. - The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4f16_ft. - Device (e.g. MacBook Pro M2, PC+RTX 3080): orin 64GB - OS (if applicable): - Encode speed (Token/s): - Decode speed (Token/s): approximately 6, and ollama is 20, the prompt len is 1024*16, - Memory usage (if applicable): The actual memory used is less than estimated.

Thanks.

Fry_Song · September 8, 2025, 3:32am

Hi,

Thanks for checking. Yes—we’ve tried Qwen Coder 30B on a Jetson Orin 64GB side-by-side with MLC and llama.cpp, and our results match the report you referenced.

Our replication (single request, ~7,000-token input; short completion):

llama.cpp (q4_k_m): 23–26 s first token **
MLC-LLM (q4f16_1): 45–52 s first token **

That’s roughly a 2× slower prefill runtime for MLC on this workload.
Per the GitHub issue, we also tried prefill_chunk_size=4096 (vs. 2048 default). The improvement was marginal on our Orin setup.

Thanks!

AastaLLL · September 9, 2025, 2:41pm

Hi,

Does the vLLM work for you? It seems that vLLM also supports FP8 activation.
We have an official vLLM container (from NGC) and can reach 226.42 tokens/sec. (But we don’t have the prefill time data)

Thanks.

Fry_Song · September 9, 2025, 2:59pm

Hi,

Thanks a lot! I haven’t tried vLLM on Thor yet—previously I didn’t see an official image. In the link you shared, the NVIDIA Google Drive says I need permission. Could you please approve access?

If vLLM on Thor can reach ~226.42 tok/s decode and a reasonable prefill speed, that’s perfectly acceptable for our current needs. We only have one Thor device on hand (more are on the way). Once I get the additional machines —I’ll run tests, perform a detailed system analysis, and share the findings and improvement strategies here.

Much appreciated!

snomile · September 9, 2025, 5:24pm

do you publish the image on NGC? the only vllm image in NGC is for x86/64, not for Thor

AastaLLL · September 15, 2025, 1:07pm

Hi, both

We currently share the container as a file.
You can find the thor_vllm_container.tar.gz on the page.

On NGC, please check nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3.
We will separate the container for Thor in the near future.

Thanks.

Topic		Replies	Views
performance issue CUDA Programming and Performance	21	10371	April 30, 2007
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29253	December 8, 2010
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8695	September 22, 2010
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23200	July 8, 2011
How to optimize this code (whirlpool hash)? Independant parallel hashing with whirlpool is way too s CUDA Programming and Performance	28	3081	September 14, 2010
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17361	June 8, 2010
Pytorch with jetpack 4.2 works slowly than 3.3 Jetson TX2	6	1383	October 18, 2021
LLaMa 2 LLMs w/ NVIDIA Jetson and textgeneration-web-ui Jetson Projects generative_ai	86	24828	May 10, 2024
tensorflow performence improvements is not linear before/after jetson_clocks.sh Jetson TX2	4	605	October 18, 2021
slower when change DefaultDeviceType from GPU to DLA? Jetson AGX Xavier	3	669	October 18, 2021

Prefill speed on MLC significantly slower than llama.cpp on Jetson Thor – any optimization suggestions?

Related topics