TRT LLM for Inference with NVFP4 safetensors slower than LM studio GGUF on the Spark

ralph.lora · October 22, 2025, 5:12pm

I was executing the TRT LLM for Inference playbook using the nvidia/Llama-3.3-70B-Instruct-FP4 LLM and loaded meta/llama-3.3-70b Q4_K_M on LM Studio. TRT LLM is using almost 90GB of memory compared to 43GB on LM Studio. Tokens per second is also very different 4.6-4.9 for LM studio and 2.5 in TRT LLM.
Probably there’s something wrong in the configuration. Could you help me figure out the difference?

aniculescu · October 22, 2025, 10:40pm

To help us support you, can you give us more information on how you served the LLM and measured the performance. Can you share your scripts and commands?

ambrosemcduffy · November 4, 2025, 8:55am

I’m experiencing the same thing the scripts and commands being used is from the tutorial here:

Any solution to this it’s actually pretty slow.

Essentially instead of using the 8b model I swapped it out for the 70b.

ambrosemcduffy · November 15, 2025, 8:52am

ping this channel again for support on this Llama 70b is pretty slow with TRT +NVFP4 which I would assume it to be faster?

eugr · November 15, 2025, 5:20pm

Inference speed is memory bound, so quant type (not size) won’t make any noticeable difference, but it can speed up prefill/prompt processing.

Having said that, I’ve yet to see any of the other solutions beat llama.cpp in token generation speed for a single request. That includes vllm, sglang, trt-llm.

ambrosemcduffy · November 15, 2025, 8:21pm

I was able to get some speed ups here with 12 tk/s when using speculative decoding and Tensor RT, I’ll keep digging a bit to see if we can push this any further.

cody.librock · March 6, 2026, 1:43am

hey mate anything new on this or is it still the same ??? i thought it was meant to allow us to have abit of faster inference speeds aswell as help with kvcache loading , im a newb though ! bigtime LOL

eugr · March 6, 2026, 5:07am

Well, vLLM got closer to llama.cpp speeds in inference, you can improve speeds when you have a cluster. Do you have any specific question in mind?

cody.librock · March 6, 2026, 5:08pm

is nvfp4 not fully optimized yet for us for the fastest ? i was looking into some of the nims and comfyui stuff , looking to build some agents and agentic workflows locally with my spark ! interested in what people have found is the best to use and utilize !! :)

eugr · March 6, 2026, 7:12pm

Yes, NVFP4 is not fully optimized yet, but it’s getting there. As of now, the best quants are FP8 (if the model is not too large and they fit), AWQ and INT4-Autoround.

Topic		Replies	Views
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	27	2936	March 26, 2026
TensorRT-LLM + nvidia/Llama-3.3-70B-Instruct-NVFP4 = 5 tok/s DGX Spark / GB10 llama	3	733	January 18, 2026
Trtllm vs vllm performance /w gpt-oss-120b DGX Spark / GB10	1	999	February 19, 2026
GLM-4.7-NVFP4 (NOT Flash) served with TRT-LLM on 2x DGX Spark DGX Spark / GB10 spark , dgx	6	677	January 26, 2026
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	2087	January 25, 2024
TRT LLM for Inference - two Sparks example is VERY slow DGX Spark / GB10	5	820	October 23, 2025
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1786	January 7, 2026
Maximum model size to build TRT-LLM Engine on DGX Spark? DGX Spark / GB10 llama , nemotron	3	742	October 27, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	32	3323	December 17, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	1946	February 13, 2026

TRT LLM for Inference with NVFP4 safetensors slower than LM studio GGUF on the Spark

Related topics