Prefill speed on MLC significantly slower than llama.cpp on Jetson Thor – any optimization suggestions?

Hi everyone,

I’ve been testing the community versions of MLC and llama.cpp on Jetson Thor, and noticed a significant performance gap during prefill. I’d like to check if this is expected behavior or if there are optimization options I might have missed.

  • Model setup: Qwen3-30B-A3B (tested with Q4_K_M and q4bf16_1/q4f16_1 quantized variants)

  • Hardware: Jetson Thor

  • Performance comparison:

    • llama.cpp: Prefill for ~10k tokens takes about 30–40 seconds

    • MLC: Same setup requires 1.5× longer for prefill

    • On the other hand, MLC seems much faster during decode (roughly 3× faster than llama.cpp)

  • Stability:

    • llama.cpp server sometimes reports illegal memory access errors

    • MLC is more stable, but the prefill speed gap is quite large

Additional note: As far as I can tell, MLC currently does not support FP8 activation yet.

My questions are:

  1. Is this prefill slowdown mainly due to MLC’s framework design, or lack of optimization/adaptation for Jetson Thor?

  2. Are there recommended build flags, runtime parameters, or configuration tweaks to improve prefill performance?

  3. If there are known issues or a roadmap for improvements, I’d really appreciate any pointers.

Thanks a lot!

Hi,

Could you share the steps you set up llama.cpp and MLC with us?
We will try to reproduce this issue internally and provide more info to you later.

Thanks

Hi, thanks for the quick response!

Here are the steps I used to set up both llama.cpp and MLC on Jetson Thor:

MLC:
model:

Run container


sudo docker run -it --rm \
  --runtime nvidia \
  --gpus all \
  -v /workspace:/workspace \
  -p 6678:6678 \
  -p 6677:6677 \
  ghcr.io/nvidia-ai-iot/mlc:r38.2.arm64-sbsa-cu130-24.04 \

Convert weight:

mlc_llm convert_weight /workspace/models/Qwen3-30B-A3B-Instruct-2507/ \
    --quantization q4bf16_1 \
    --model-type qwen3_moe\ 
    --device cuda \ 
    --source-format huggingface-safetensor\
    -o /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4f16_1

gen_config:

mlc_llm gen_config \
    /workspace/models/Qwen3-30B-A3B-Instruct-2507/\
    --quantization q4bf16_1 \
    --conv-template qwen2\
    --context-window-size 32768\
    --prefill-chunk-size 4096 \
    --max-batch-size 3\
    --output /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1

compile:

mlc_llm compile /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/mlc-chat-config.json \
    --device cuda \
    -o /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/Qwen3-30B-A3B-Instruct-2507-q4bf16_1-cuda.so \
    --quantization q4f16_1 \
    --model-type qwen3_moe \
    --opt="cublas_gemm=1;cudagraph=1" \
   

Serve

mlc_llm serve /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1 \
  --port 6678 \
  --host 0.0.0.0 \
  --device cuda \
  --mode interactive \
  --model-lib /workspace/models/mlc/Qwen3-30B-A3B-Instruct-2507-q4bf16_1/Qwen3-30B-A3B-Instruct-2507-q4bf16_1-cuda.so \
  --overrides "max_num_sequence=1;max_total_seq_length=32768;context_window_size=32768;gpu_memory_utilization=0.3"

Llama.cpp
model:

Run container

cd /workspace/llama.cpp/build/bin
./llama-server \
    -m /workspace/models/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf\
    --host "0.0.0.0" \
    --port 6678 \
    -ngl 99 \
    -c 32768 

Serve

llama-server \
    -m /workspace/models/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-Q4_K_M.gguf\
    --host "0.0.0.0" \
    --port 6678 \
    -ngl 99 \
    -c 32768

Thanks!

Hi,

Thanks for sharing the details.

We have set up MLC and llama.cpp on Thor locally and are now downloading the model.
Will update the status from our side with you later.

Thanks.

Hi, thanks for the update! Looking forward to your findings.

Hi,

Do you have the client app that can measure the performance as well as the prefill time for MLC and llama.cpp?
If so, could you also share it with us?

We have converted the model into MLC and want to test it with the same app you used.
Thanks.

Hi, thanks for the follow-up.

Our user scenario is mainly 1–3 concurrent personal agent sessions, where the common workload is either:

generating a ~2000-word story, or

summarizing a long document of around 10k tokens.

That’s the context in which we’ve been measuring performance (prefill time and throughput) for both MLC and llama.cpp.

Hi,

We test MLC with the simple sample and can get below performance with Qwen3-30B+q4bf16.

Elapsed time: 6.94 sec
Completion tokens: 500
Approx. tokens/sec: 72.04

Is this the same as your testing?
Thanks.

Hi,

Yes—those numbers line up with what we see. With Qwen3-30B + q4bf16 on the simple sample, MLC’s short-completion decode throughput (~72 tok/s for 500 tokens in ~6.9 s) is roughly 3× our llama.cpp baseline on the same setup.

However, for long-context, prefill-dominated workloads (e.g., summarizing ~10k-token inputs), MLC is noticeably slower end-to-end on our side: ~2+ minutes for the job, whereas llama.cpp finishes in ~40 seconds under the same settings.

Thanks for the quick follow-up —much appreciated!

Hi,

Have you tried other models or on other devices (ex., Orin or x86)?
It seems that the Qwen model runs slower on the MLC:

Thanks.

Hi,

Thanks for checking. Yes—we’ve tried Qwen Coder 30B on a Jetson Orin 64GB side-by-side with MLC and llama.cpp, and our results match the report you referenced.

Our replication (single request, ~7,000-token input; short completion):

  • llama.cpp (q4_k_m): 23–26 s first token **
  • MLC-LLM (q4f16_1): 45–52 s first token **

That’s roughly a 2× slower prefill runtime for MLC on this workload.
Per the GitHub issue, we also tried prefill_chunk_size=4096 (vs. 2048 default). The improvement was marginal on our Orin setup.

Thanks!

Hi,

Does the vLLM work for you? It seems that vLLM also supports FP8 activation.
We have an official vLLM container (from NGC) and can reach 226.42 tokens/sec. (But we don’t have the prefill time data)

Thanks.

Hi,

Thanks a lot! I haven’t tried vLLM on Thor yet—previously I didn’t see an official image. In the link you shared, the NVIDIA Google Drive says I need permission. Could you please approve access?

If vLLM on Thor can reach ~226.42 tok/s decode and a reasonable prefill speed, that’s perfectly acceptable for our current needs. We only have one Thor device on hand (more are on the way). Once I get the additional machines —I’ll run tests, perform a detailed system analysis, and share the findings and improvement strategies here.

Much appreciated!

do you publish the image on NGC? the only vllm image in NGC is for x86/64, not for Thor

Hi, both

We currently share the container as a file.
You can find the thor_vllm_container.tar.gz on the page.

On NGC, please check nvcr.io/nvidia/tritonserver:25.08-vllm-python-py3.
We will separate the container for Thor in the near future.

Thanks.

2 Likes