Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

Originally published at: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/

Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source library is now available for free on the /NVIDIA/TensorRT-LLM GitHub repo and as part of the NVIDIA NeMo framework. Large language models (LLMs) have revolutionized the field of artificial intelligence and created…

We’re really excited to be releasing TensorRT-LLM to everyone, and we hope you all find it a valuable tool for accelerating and deploying your LLMs. If you have any questions or comments, let us know.

Could you please provide steps for building TensorRT-LLM from source, instead of docker build.
Thanks a lot.

Hi @yu.cai – you can find some information on the process in the documentation here. There’s not much difference between the docker build process and the process for building from source, the main thing is that the docker method makes it simpler to ensure you have the right dependencies.

It might be useful to take a look at the Dockerfile itself. The key thing is to ensure that you have the right versions of the TensorRT base, PyTorch, and polygraphy. Hope this helps!

Hi Neal, Thanks for your reply.
I understand the purpose of using docker, but I also think that a conda env. can also ensure the right 3rd-party dependencies?

That should work as well, but some of the dependencies are not available as conda packages. For example, TensorRT-LLM requires a version of TensorRT that right now is only available as a download from nvidia.com (see this script for details). So you would still probably need to manually download some of the packages.

How can we use top_p, top_k and temperature parameters while deploying the tensorrt-llm model with tensorrtllm_backend of triton inference server.

Exactly followed all steps and worked successfully until the last step.

When i gave the last command
python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1

I was getting the error

Internal: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq);
I1128 16:31:49.479354 227 server.cc:592]

When compiliing the model (calling the build function), do I have to add any special argument if I will deploy it with multiple GPUs?

Reading this README it says I can add these parameters world_size, tp_size, pp_size, parallel_build. But is that for compilation only. Or do any of these numbers have to match world_size at the end of your tutorial?