Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

jwitsoe · October 19, 2023, 4:00pm

Originally published at: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/

Today, NVIDIA announces the public release of TensorRT-LLM to accelerate and optimize inference performance for the latest LLMs on NVIDIA GPUs. This open-source library is now available for free on the /NVIDIA/TensorRT-LLM GitHub repo and as part of the NVIDIA NeMo framework. Large language models (LLMs) have revolutionized the field of artificial intelligence and created…

neal.vaidya · October 19, 2023, 5:13pm

We’re really excited to be releasing TensorRT-LLM to everyone, and we hope you all find it a valuable tool for accelerating and deploying your LLMs. If you have any questions or comments, let us know.

yu.cai · October 24, 2023, 2:29pm

Could you please provide steps for building TensorRT-LLM from source, instead of docker build.
Thanks a lot.

neal.vaidya · October 24, 2023, 4:57pm

Hi @yu.cai – you can find some information on the process in the documentation here. There’s not much difference between the docker build process and the process for building from source, the main thing is that the docker method makes it simpler to ensure you have the right dependencies.

It might be useful to take a look at the Dockerfile itself. The key thing is to ensure that you have the right versions of the TensorRT base, PyTorch, and polygraphy. Hope this helps!

yu.cai · October 25, 2023, 2:17am

Hi Neal, Thanks for your reply.
I understand the purpose of using docker, but I also think that a conda env. can also ensure the right 3rd-party dependencies?

neal.vaidya · October 26, 2023, 10:57pm

That should work as well, but some of the dependencies are not available as conda packages. For example, TensorRT-LLM requires a version of TensorRT that right now is only available as a download from nvidia.com (see this script for details). So you would still probably need to manually download some of the packages.

veera.vignesh · November 1, 2023, 6:30am

Hi,
How can we use top_p, top_k and temperature parameters while deploying the tensorrt-llm model with tensorrtllm_backend of triton inference server.
TIA

narenbaburs · November 29, 2023, 9:51am

Exactly followed all steps and worked successfully until the last step.

When i gave the last command
python /opt/scripts/launch_triton_server.py --model_repo /all_models/inflight_batcher_llm --world_size 1

I was getting the error

Internal: unexpected error when creating modelInstanceState: maxTokensInPagedKvCache must be large enough to process at least 1 sequence to completion (i.e. must be larger than beam_width * tokensPerBlock * maxBlocksPerSeq);
I1128 16:31:49.479354 227 server.cc:592]

enochlev · January 25, 2024, 3:37pm

When compiliing the model (calling the build function), do I have to add any special argument if I will deploy it with multiple GPUs?

Reading this README it says I can add these parameters world_size, tp_size, pp_size, parallel_build. But is that for compilation only. Or do any of these numbers have to match world_size at the end of your tutorial?

Topic		Replies	Views
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	1	275	May 3, 2024
Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton Technical Blog	1	376	February 2, 2024
Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server Technical Blog	62	3579	August 28, 2024
Scaling LLMs with NVIDIA Triton and NVIDIA TensorRT-LLM Using Kubernetes Technical Blog llama	1	30	October 22, 2024
TensorRT-LLM Speculative Decoding Boosts Inference Throughput by up to 3.6x Technical Blog	4	83	January 9, 2025
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1041	September 27, 2023
Accelerating Hebrew LLM Performance with NVIDIA TensorRT-LLM Technical Blog	2	16	August 22, 2024
Deploying Models from TensorFlow Model Zoo Using NVIDIA DeepStream and NVIDIA Triton Inference Server Technical Blog	13	1182	May 25, 2022
Tips for Building a RAG Pipeline with NVIDIA AI LangChain AI Endpoints Technical Blog	10	489	August 28, 2024
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	3	162	February 3, 2025

Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available

Related topics