Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

anjshah · May 16, 2024, 5:27pm

Hi @dhiaulm - looks like you have downloaded the wrong checkpoint file as described here. Can you please try to download the HF model checkpoint again by cloning correctly?

anjshah · May 16, 2024, 5:32pm

Hi @deepikasv1703 - It looks like the GPU you’re running on is not supported. Can you confirm which GPU? Also, look at this solution if using a V100 or a GPU that doesn’t support FMHA kernel.

deepikasv1703 · May 17, 2024, 3:10am

Hi @anjshah - What you said i have tried at v100 or a GPU it is not supporting, could help me to run a docker run time on colab

!docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

this command is not working on colab to use gpu connection

anjshah · May 17, 2024, 4:50pm

Hi @deepikasv1703 - did you follow these steps to get docker running in colab?

anjshah · May 21, 2024, 2:57am

Hi - this error occurs when converting checkpoint on a device with insufficient memory. Consider switching to a device with more memory.

anjshah · May 21, 2024, 2:59am

Hi @rajat.jain - if this error still persists, trying switching to v0.9.0 for both TensorRT-LLM and tensorrt-llm_backend and use this version of tritonserver

richard.forman · May 24, 2024, 12:59am

Hi I’m trying to follow this blog, but I’m getting hung up at:

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

When I copy paste that command, i get the error:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

anjshah · May 24, 2024, 1:01am

Hi @richard.forman - seems like you don’t have permissions to run docker in the environment you’re trying to run in

richard.forman · May 24, 2024, 1:07am

I was able to docker pull a container earlier, so I I think I have permissions. I also tried the same command on a machine that I have built containers on before, and I got the same error.

anjshah · May 24, 2024, 1:30am

See if you can resolve using this suggestion

richard.forman · May 24, 2024, 3:33am

Hi AnjShah, Thank you! That suggestion worked for me.

richard.forman · May 24, 2024, 4:21am

It is mentioned in the Retrieving the Model Weights Section, that:

You can also download the weights to use offline with the following command and update the paths in later commands to point to this directory:

What are the later commands, and how do I update the paths in those commands to point to the directory that the model weights were cloned to?

Do I need to run the TensorRT-LLM container in a way that gives it access to the directory that the model weights are located?

anjshah · May 24, 2024, 4:30am

Yes, once you git clone and download the model weights in the TensorRT-LLM directory (earlier step) you can provide the the path to this directory as shown in the blog. It’s the --volume option to the docker command that mounts your current working directory (TensorRT-LLM) with the downloaded model weights to a directory inside the container that you can access. Just follow the provided steps and let us know if you run into issues

prots.igor · May 24, 2024, 5:21pm

I did all steps, run triton with converted llama3 model, call API with

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How to calculate 3 plus 4 ?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

But it can’t stop generating response properly, looks like some config missed, in the response I see the following:

"text_output": " 3 + 4 = 7\nHow to calculate 5 plus 6? 5 + 6 = 11\nHow to calculate 7 plus 8? 7 + 8 = 15\nHow to calculate 9 plus 10? 9 + 10 = 19\nHow to calculate 11 plus 12? 11 + 12 = 23\nHow to calculate 13 plus 14? 13 + 14 = 27\nHow"

What should I do for proper model/request configuration?

anjshah · May 24, 2024, 6:13pm

Hi @prots.igor - can you trying adding more stop words and experiment more with the max_tokens parameter? Also, the tutorial uses the instruction tuned variant, so feel free to add to the prompt that the output be brief and succinct.

prots.igor · May 25, 2024, 7:53pm

Hi @anjshah Where should I add more stop words in the API request or in the model config? I’m not sure that max_tokens parameter will help because in the production env it probably will be more than 100 tokens.
It looks like this example in the blogpost is not finished because you showed only request to the model without answer from model.

anjshah · May 25, 2024, 8:44pm

It’s part of the curl request as described in the blog:

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'
This is where you experiment with max_tokens and stop_words when you send request to the server. Outputs are probabilistic, so not part of the blog. Blogpost is complete.

attaullakhan607 · May 28, 2024, 3:17am

we face the same, but found at this area the proper solution, how to do and manipulate them as well.

prots.igor · May 28, 2024, 5:47pm

For me works next request body:

{
  "text_input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful assistant. Answer the last Human question in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>Where is Ukraine located?<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
  "parameters": {
  	"max_tokens": 300,
  	"bad_words":[""],
  	"stop_words":["<|eot_id|>"]
   }
}

When I have added "stop_words":["<|eot_id|>"] model stopped loop generation.
@anjshah thanks for the help.

richard.forman · May 28, 2024, 11:48pm

Hi @anjshah,
I am at the section where it states:

Next, we must modify the configuration files from the repository skeleton with the location of the compiled model engine. We must also update configuration parameters such as tokenizer to use and handle memory allocation for the KV cache when batching requests for inference.

However, I do not see the paths of the configuration files that need to be modified. What are the filepaths that need editing?

Topic		Replies	Views
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	14	202	September 17, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1750	January 25, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	286	May 3, 2024
Triton Inference Server + vLLM Backend on the NVIDIA Jetson AGX Orin 64GB Developer Kit Jetson Projects generative_ai	9	653	June 16, 2025
LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM Technical Blog nim	1	40	July 7, 2025
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron Technical Blog	3	974	March 31, 2023
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	199	February 3, 2025
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	3	523	April 18, 2024
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1053	November 15, 2023
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes Technical Blog llama	1	44	October 22, 2024

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Related topics