Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Hi @dhiaulm - looks like you have downloaded the wrong checkpoint file as described here. Can you please try to download the HF model checkpoint again by cloning correctly?

Hi @deepikasv1703 - It looks like the GPU you’re running on is not supported. Can you confirm which GPU? Also, look at this solution if using a V100 or a GPU that doesn’t support FMHA kernel.

Hi @anjshah - What you said i have tried at v100 or a GPU it is not supporting, could help me to run a docker run time on colab

!docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

this command is not working on colab to use gpu connection

Hi @deepikasv1703 - did you follow these steps to get docker running in colab?

Hi - this error occurs when converting checkpoint on a device with insufficient memory. Consider switching to a device with more memory.

Hi @rajat.jain - if this error still persists, trying switching to v0.9.0 for both TensorRT-LLM and tensorrt-llm_backend and use this version of tritonserver

Hi I’m trying to follow this blog, but I’m getting hung up at:

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

When I copy paste that command, i get the error:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.

Hi @richard.forman - seems like you don’t have permissions to run docker in the environment you’re trying to run in

I was able to docker pull a container earlier, so I I think I have permissions. I also tried the same command on a machine that I have built containers on before, and I got the same error.

See if you can resolve using this suggestion

Hi AnjShah, Thank you! That suggestion worked for me.

1 Like

It is mentioned in the Retrieving the Model Weights Section, that:

You can also download the weights to use offline with the following command and update the paths in later commands to point to this directory:

What are the later commands, and how do I update the paths in those commands to point to the directory that the model weights were cloned to?

Do I need to run the TensorRT-LLM container in a way that gives it access to the directory that the model weights are located?

Yes, once you git clone and download the model weights in the TensorRT-LLM directory (earlier step) you can provide the the path to this directory as shown in the blog. It’s the --volume option to the docker command that mounts your current working directory (TensorRT-LLM) with the downloaded model weights to a directory inside the container that you can access. Just follow the provided steps and let us know if you run into issues

I did all steps, run triton with converted llama3 model, call API with

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How to calculate 3 plus 4 ?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'

But it can’t stop generating response properly, looks like some config missed, in the response I see the following:

"text_output": " 3 + 4 = 7\nHow to calculate 5 plus 6? 5 + 6 = 11\nHow to calculate 7 plus 8? 7 + 8 = 15\nHow to calculate 9 plus 10? 9 + 10 = 19\nHow to calculate 11 plus 12? 11 + 12 = 23\nHow to calculate 13 plus 14? 13 + 14 = 27\nHow"

What should I do for proper model/request configuration?

Hi @prots.igor - can you trying adding more stop words and experiment more with the max_tokens parameter? Also, the tutorial uses the instruction tuned variant, so feel free to add to the prompt that the output be brief and succinct.

Hi @anjshah Where should I add more stop words in the API request or in the model config? I’m not sure that max_tokens parameter will help because in the production env it probably will be more than 100 tokens.
It looks like this example in the blogpost is not finished because you showed only request to the model without answer from model.

It’s part of the curl request as described in the blog:

curl -X POST localhost:8000/v2/models/ensemble/generate -d \
'{
"text_input": "How do I count to nine in French?",
"parameters": {
"max_tokens": 100,
"bad_words":[""],
"stop_words":[""]
}
}'
This is where you experiment with max_tokens and stop_words when you send request to the server. Outputs are probabilistic, so not part of the blog. Blogpost is complete.

we face the same, but found at this area the proper solution, how to do and manipulate them as well.

1 Like

For me works next request body:

{
  "text_input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a helpful assistant. Answer the last Human question in JSON format.<|eot_id|><|start_header_id|>user<|end_header_id|>Where is Ukraine located?<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
  "parameters": {
  	"max_tokens": 300,
  	"bad_words":[""],
  	"stop_words":["<|eot_id|>"]
   }
}

When I have added "stop_words":["<|eot_id|>"] model stopped loop generation.
@anjshah thanks for the help.

1 Like

Hi @anjshah,
I am at the section where it states:

Next, we must modify the configuration files from the repository skeleton with the location of the compiled model engine. We must also update configuration parameters such as tokenizer to use and handle memory allocation for the KV cache when batching requests for inference.

However, I do not see the paths of the configuration files that need to be modified. What are the filepaths that need editing?