I was executing the TRT LLM for Inference playbook using the nvidia/Llama-3.3-70B-Instruct-FP4 LLM and loaded meta/llama-3.3-70b Q4_K_M on LM Studio. TRT LLM is using almost 90GB of memory compared to 43GB on LM Studio. Tokens per second is also very different 4.6-4.9 for LM studio and 2.5 in TRT LLM.
Probably there’s something wrong in the configuration. Could you help me figure out the difference?
To help us support you, can you give us more information on how you served the LLM and measured the performance. Can you share your scripts and commands?
I’m experiencing the same thing the scripts and commands being used is from the tutorial here:
Any solution to this it’s actually pretty slow.
Essentially instead of using the 8b model I swapped it out for the 70b.
ping this channel again for support on this Llama 70b is pretty slow with TRT +NVFP4 which I would assume it to be faster?
Inference speed is memory bound, so quant type (not size) won’t make any noticeable difference, but it can speed up prefill/prompt processing.
Having said that, I’ve yet to see any of the other solutions beat llama.cpp in token generation speed for a single request. That includes vllm, sglang, trt-llm.
I was able to get some speed ups here with 12 tk/s when using speculative decoding and Tensor RT, I’ll keep digging a bit to see if we can push this any further.
hey mate anything new on this or is it still the same ??? i thought it was meant to allow us to have abit of faster inference speeds aswell as help with kvcache loading , im a newb though ! bigtime LOL
Well, vLLM got closer to llama.cpp speeds in inference, you can improve speeds when you have a cluster. Do you have any specific question in mind?
is nvfp4 not fully optimized yet for us for the fastest ? i was looking into some of the nims and comfyui stuff , looking to build some agents and agentic workflows locally with my spark ! interested in what people have found is the best to use and utilize !! :)
Yes, NVFP4 is not fully optimized yet, but it’s getting there. As of now, the best quants are FP8 (if the model is not too large and they fit), AWQ and INT4-Autoround.