TRT LLM for Inference with NVFP4 safetensors slower than LM studio GGUF on the Spark

I was executing the TRT LLM for Inference playbook using the nvidia/Llama-3.3-70B-Instruct-FP4 LLM and loaded meta/llama-3.3-70b Q4_K_M on LM Studio. TRT LLM is using almost 90GB of memory compared to 43GB on LM Studio. Tokens per second is also very different 4.6-4.9 for LM studio and 2.5 in TRT LLM.
Probably there’s something wrong in the configuration. Could you help me figure out the difference?

To help us support you, can you give us more information on how you served the LLM and measured the performance. Can you share your scripts and commands?

I’m experiencing the same thing the scripts and commands being used is from the tutorial here:

Any solution to this it’s actually pretty slow.

Essentially instead of using the 8b model I swapped it out for the 70b.