Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

jwitsoe · April 22, 2024, 4:07pm

Originally published at: https://developer.nvidia.com/blog/turbocharging-meta-llama-3-performance-with-nvidia-tensorrt-llm-and-nvidia-triton-inference-server/

We’re excited to announce support for the Meta Llama 3 family of models in NVIDIA TensorRT-LLM, accelerating and optimizing your LLM inference performance. You can immediately try Llama 3 8B and Llama 3 70B—the first models in the series—through a browser user interface. Or, through API endpoints running on a fully accelerated NVIDIA stack from…

ryan_lin · May 1, 2024, 4:00pm

Hi
I’m following this blog post to build the TensorRT model for LLAMA3 8B.

I got an following error (NotImplementedError: Cannot copy out of meta tensor; no data!) when running model conversion script, has anyone encountered the same issue?

python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16

root@9ab89891c633:/TensorRT-LLM# python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16
[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:24<00:00, 51.13s/it]
[05/01/2024-15:55:41] Some parameters are on the meta device device because they were offloaded to the cpu.
Traceback (most recent call last):
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1532, in <module>
    main()
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1508, in main
    covert_and_save(rank, convert_args)
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 1480, in covert_and_save
    weights = convert_hf_llama(
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 820, in convert_hf_llama
    q_weight = get_weight(model_params, prefix + 'self_attn.q_proj', dtype)
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 630, in get_weight
    return config[prefix + '.weight'].detach().cpu()
NotImplementedError: Cannot copy out of meta tensor; no data!

anjshah · May 1, 2024, 10:18pm

Hi Ryan - please make sure you’ve downloaded the HF checkpoint of Llama3 and your --model_dir parameter should point to the location of the checkpoint

msgersch2 · May 2, 2024, 1:47am

Hello,

I followed the instructions and am running this on Ubuntu 22.04.4 with A100 cards. All the commands returned fine until I ran:

python3 examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/bf16/1-gpu --max_output_len 100 --tokenizer_dir ./Meta-Llama-3-8B-Instruct --input_text "How do I count to nine in French?"

and it errors out with the following:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be array, but is null
[TensorRT-LLM][WARNING] Optional value for parameter lora_target_modules will not be set.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Loaded engine size: 15320 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 15494, GPU 15761 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 15495, GPU 15771 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +15316, now: CPU 0, GPU 15316 (MiB)
[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
File “/TensorRT-LLM/examples/run.py”, line 504, in
main(args)
File “/TensorRT-LLM/examples/run.py”, line 379, in main
runner = runner_cls.from_dir(**runner_kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/model_runner_cpp.py”, line 169, in from_dir
session = GptSession(config=session_config,

nvidia-smi in the docker container does show the GPUs and the other commands do use one of the GPUs during the conversion process. Any ideas on what I’m missing?

anjshah · May 2, 2024, 2:02am

These looks like warning/info messages. What’s the exact error message you see?

msgersch2 · May 2, 2024, 3:45am

I changed the docker command to:

–gpus 1

from any and that seemed to fix it.

chenning_yu · May 3, 2024, 12:02am

Hi Ryan,

I’ve also encountered a similar problem. I solved it by adding the --load_model_on_cpu option to the command.

This method is provided in one post on a GitHub issue page: https://github.com/NVIDIA/TensorRT-LLM/issues/1440

rbhagwat1 · May 3, 2024, 6:33am

Hi @anjshah Anjali Shah, Do you have any benchmarking results? Thank you

anjshah · May 3, 2024, 4:18pm

Hi @rbhagwat1 - we support many models and have published benchmark results for some of them here. Please follow these steps to generate additional benchmarking results as per your requirements.

msgersch2 · May 6, 2024, 4:30pm

Do you have examples of running tritonserver with tp_size > 1? I

anjshah · May 6, 2024, 5:22pm

Hi @msgersch2 - yes, you would need to build the trt-llm engine with the tp_size > 1 such as 2, 4, etc. and then use the --world_size config parameter for the tritonserver to match the tp_size you used as shown here

rajat.jain · May 7, 2024, 11:13am

python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct \
            --output_dir ./tllm_checkpoint_1gpu_bf16 \
            --dtype bfloat16

Getting the below error when running the above command-

[TensorRT-LLM] TensorRT-LLM version: 0.8.0Traceback (most recent call last):
  File "/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 10, in <module>
    from tensorrt_llm._utils import release_gc
ImportError: cannot import name 'release_gc' from 'tensorrt_llm._utils' (/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py)

anjshah · May 7, 2024, 8:06pm

Hi @rajat.jain - can you share all the steps you followed and if there were previous versions of the library installed?

rajat.jain · May 8, 2024, 7:07am

@anjshah I have followed the exact steps listed in the blog here. Also this was a fresh installation, the library wasn’t installed before.

rajat.jain · May 8, 2024, 1:25pm

Getting this same error with tensorrt_llm version 0.8.0, and repo checked out to v0.8.0

anjshah · May 8, 2024, 4:07pm

Hi @rajat.jain - it doesn’t look like the library is properly installed. Also the instructions are different based on underlying OS (Windows or Linux). If issue persists after you uninstall and install again, please create an issue here for our support team to look into

dhiaulm · May 14, 2024, 6:27am

i found tutorial here

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server | NVIDIA Technical Blog

and running tensort-rt using docker and install dependencies with this command:

# Obtain and start the basic docker image environment.
docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

# Install dependencies, TensorRT-LLM requires Python 3.10
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

# Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.
pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

after that convert model llama3 8b using this command:

python3 examples/llama/convert_checkpoint.py --model_dir ./Meta-Llama-3-8B-Instruct
–output_dir ./tllm_checkpoint_1gpu_bf16
–dtype bfloat16

I got Error :
[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 25%|██████████████████████████████████████████▎ | 1/4 [00:01<00:04, 1.57s/it]Traceback (most recent call last):
File “/TensorRT-LLM/examples/llama/convert_checkpoint.py”, line 1532, in
main()
File “/TensorRT-LLM/examples/llama/convert_checkpoint.py”, line 1420, in main
model = AutoModelForCausalLM.from_pretrained(
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py”, line 566, in from_pretrained
return model_class.from_pretrained(
File “/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py”, line 3694, in from_pretrained
) = cls._load_pretrained_model(
File “/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py”, line 4079, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File “/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py”, line 503, in load_state_dict
with safe_open(checkpoint_file, framework=“pt”) as f:
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer

anyaone can help?

deepikasv1703 · May 16, 2024, 9:56am

Hi Anyone, Help on this error

https://colab.research.google.com/drive/11d-IO_LcbyWWPsraRQYGBKo4pJUKLyns#scrollTo=O_kxyeM1IonM

I have facing issue on colab notebook not converting to engine. (Steps involved below here)

!git clone -b v0.8.0 GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
!cd TensorRT-LLM

#Download Hugging Face module
!pip install huggingface-hub

#Authenticate token
!huggingface-cli login

#run the credential helper
!git config --global credential.helper store

#Rerun the Authenticate token
!huggingface-cli login

#Intialize the Git and clone HF repo to colab
!git lfs install
clone Llama 3 model
!git clone meta-llama/Meta-Llama-3-8B-Instruct · Hugging Face

Obtain and start the basic docker image environment. (Below docker command is not working on colab)

!docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

Install dependencies, TensorRT-LLM requires Python 3.10

!apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.

!pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

Build the Llama 8B model using a single GPU and BF16.

!python3 /content/TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir /content/Meta-Llama-3-8B-Instruct
–output_dir ./tllm_checkpoint_1gpu_bf16
–load_model_on_cpu
–dtype bfloat16

!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16
–output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu
–gpt_attention_plugin bfloat16
–gemm_plugin bfloat16

##After running the # Build the Llama 8B model using a single GPU and BF16 script. error shown below

[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 100% 4/4 [00:01<00:00, 3.25it/s]
Weights loaded. Total time: 00:00:51
Total time of converting checkpoints: 00:02:31
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/15/2024-20:05:17] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set lookup_plugin to None.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set lora_plugin to None.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set context_fmha to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set remove_input_padding to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set multi_block_mode to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set enable_xqa to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/15/2024-20:05:17] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/15/2024-20:05:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 260, GPU 103 (MiB)
[05/15/2024-20:05:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +978, GPU +180, now: CPU 1374, GPU 283 (MiB)
[05/15/2024-20:05:24] [TRT-LLM] [I] Set nccl_plugin to None.
[05/15/2024-20:05:24] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/15/2024-20:05:24] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[05/15/2024-20:05:24] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_{75}.
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 (/home/jenkins/agent/workspace/LLM/release-0.8/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:444)
1 0x7e2ab62b0803 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38803) [0x7e2ab62b0803]
2 0x7e2ab62b0a9e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38a9e) [0x7e2ab62b0a9e]
3 0x7e2ab62d80eb tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(int, int, int, int, float, tensorrt_llm::kernels::PositionEmbeddingType, int, float, tensorrt_llm::kernels::RotaryScalingType, float, int, int, int, bool, tensorrt_llm::kernels::ContextFMHAType, bool, bool, int, bool, tensorrt_llm::kernels::AttentionMaskType, bool, int, nvinfer1::DataType, int, bool, bool, int, bool, bool, bool, bool, bool) + 219
4 0x7e2ab62d8b84 tensorrt_llm::plugins::GPTAttentionPluginCreator::createPlugin(char const*, nvinfer1::PluginFieldCollection const*) + 2644
5 0x7e2b4496030a /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x16030a) [0x7e2b4496030a]
6 0x7e2b44843433 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43433) [0x7e2b44843433]
7 0x5699b825810e /usr/bin/python3(+0x15a10e) [0x5699b825810e]
8 0x5699b824ea7b _PyObject_MakeTpCall + 603
9 0x5699b8266acb /usr/bin/python3(+0x168acb) [0x5699b8266acb]
10 0x5699b8246cfa _PyEval_EvalFrameDefault + 24906
11 0x5699b82589fc _PyFunction_Vectorcall + 124
12 0x5699b8267492 PyObject_Call + 290
13 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
14 0x5699b82589fc _PyFunction_Vectorcall + 124
15 0x5699b8267492 PyObject_Call + 290
16 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
17 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
18 0x5699b8267492 PyObject_Call + 290
19 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
20 0x5699b82589fc _PyFunction_Vectorcall + 124
21 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
22 0x5699b826386c _PyObject_Call_Prepend + 92
23 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
24 0x5699b824ea7b _PyObject_MakeTpCall + 603
25 0x5699b8248150 _PyEval_EvalFrameDefault + 30112
26 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
27 0x5699b8267492 PyObject_Call + 290
28 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
29 0x5699b82589fc _PyFunction_Vectorcall + 124
30 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
31 0x5699b826386c _PyObject_Call_Prepend + 92
32 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
33 0x5699b826742b PyObject_Call + 187
34 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
35 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
36 0x5699b824253c _PyEval_EvalFrameDefault + 6540
37 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
38 0x5699b8267492 PyObject_Call + 290
39 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
40 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
41 0x5699b8267492 PyObject_Call + 290
42 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
43 0x5699b82589fc _PyFunction_Vectorcall + 124
44 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
45 0x5699b826386c _PyObject_Call_Prepend + 92
46 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
47 0x5699b826742b PyObject_Call + 187
48 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
49 0x5699b82589fc _PyFunction_Vectorcall + 124
50 0x5699b824126d _PyEval_EvalFrameDefault + 1725
51 0x5699b82589fc _PyFunction_Vectorcall + 124
52 0x5699b8267492 PyObject_Call + 290
53 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
54 0x5699b82589fc _PyFunction_Vectorcall + 124
55 0x5699b8267492 PyObject_Call + 290
56 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
57 0x5699b82589fc _PyFunction_Vectorcall + 124
58 0x5699b8267492 PyObject_Call + 290
59 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
60 0x5699b82589fc _PyFunction_Vectorcall + 124
61 0x5699b824126d _PyEval_EvalFrameDefault + 1725
62 0x5699b823d9c6 /usr/bin/python3(+0x13f9c6) [0x5699b823d9c6]
63 0x5699b8333256 PyEval_EvalCode + 134
64 0x5699b835e108 /usr/bin/python3(+0x260108) [0x5699b835e108]
65 0x5699b83579cb /usr/bin/python3(+0x2599cb) [0x5699b83579cb]
66 0x5699b835de55 /usr/bin/python3(+0x25fe55) [0x5699b835de55]
67 0x5699b835d338 _PyRun_SimpleFileObject + 424
68 0x5699b835cf83 _PyRun_AnyFileObject + 67
69 0x5699b834fa5e Py_RunMain + 702
70 0x5699b832602d Py_BytesMain + 45
71 0x7e2d15d3bd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e2d15d3bd90]
72 0x7e2d15d3be40 __libc_start_main + 128
73 0x5699b8325f25 _start + 37
Traceback (most recent call last):
File “/usr/local/bin/trtllm-build”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 497, in main
parallel_build(source, build_config, args.output_dir, workers,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 420, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 392, in build_and_save
engine = build(build_config,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 282, in build
return build_model(model, build_config)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 198, in build_model
model(**inputs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py”, line 498, in forward
hidden_states = self.transformer.forward(**kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py”, line 202, in forward
hidden_states = self.layers.forward(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py”, line 255, in forward
hidden_states = layer(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py”, line 116, in forward
attention_output = self.attention(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py”, line 742, in forward
context, past_key_value = gpt_attention(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py”, line 561, in wrapper
outs = f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py”, line 3813, in gpt_attention
layer = default_trtnet().add_plugin_v2(plug_inputs, attn_plug)
TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:
1. (self: tensorrt_bindings.tensorrt.INetworkDefinition, inputs: List[tensorrt_bindings.tensorrt.ITensor], plugin: tensorrt_bindings.tensorrt.IPluginV2) → tensorrt_bindings.tensorrt.IPluginV2Layer

Invoked with: <tensorrt_bindings.tensorrt.INetworkDefinition object at 0x7e2b2d1ead70>, [<tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fcc7f0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9cb70>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d0b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d870>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fa26b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d370>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fa28b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9ce30>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f569f0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f56cb0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d5b0>], None

deepikasv1703 · May 16, 2024, 11:52am

Hi Anjshah, Could you help here.

!git clone -b v0.8.0 GitHub - NVIDIA/TensorRT-LLM: TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
!cd TensorRT-LLM

#Download Hugging Face module
!pip install huggingface-hub

#Authenticate token
!huggingface-cli login

#run the credential helper
!git config --global credential.helper store

#Rerun the Authenticate token
!huggingface-cli login

#Intialize the Git and clone HF repo to colab
!git lfs install
clone Llama 3 model
!git clone meta-llama/Meta-Llama-3-8B-Instruct · Hugging Face

Obtain and start the basic docker image environment. (Below docker command is not working on colab)

!docker run --rm --runtime=nvidia --gpus all --volume ${PWD}:/TensorRT-LLM --entrypoint /bin/bash -it --workdir /TensorRT-LLM nvidia/cuda:12.1.0-devel-ubuntu22.04

Install dependencies, TensorRT-LLM requires Python 3.10

!apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev

Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.

!pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com

Build the Llama 8B model using a single GPU and BF16.

!python3 /content/TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir /content/Meta-Llama-3-8B-Instruct
–output_dir ./tllm_checkpoint_1gpu_bf16
–load_model_on_cpu
–dtype bfloat16

!trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16
–output_dir ./tmp/llama/8B/trt_engines/bf16/1-gpu
–gpt_attention_plugin bfloat16
–gemm_plugin bfloat16

##After running the # Build the Llama 8B model using a single GPU and BF16 script. error shown below

[TensorRT-LLM] TensorRT-LLM version: 0.8.00.8.0
Loading checkpoint shards: 100% 4/4 [00:01<00:00, 3.25it/s]
Weights loaded. Total time: 00:00:51
Total time of converting checkpoints: 00:02:31
[TensorRT-LLM] TensorRT-LLM version: 0.8.0[05/15/2024-20:05:17] [TRT-LLM] [I] Set bert_attention_plugin to float16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set gpt_attention_plugin to bfloat16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set gemm_plugin to bfloat16.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set lookup_plugin to None.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set lora_plugin to None.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set context_fmha to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set context_fmha_fp32_acc to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set paged_kv_cache to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set remove_input_padding to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set multi_block_mode to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set enable_xqa to True.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set attention_qk_half_accumulation to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set tokens_per_block to 128.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[05/15/2024-20:05:17] [TRT-LLM] [I] Set use_context_fmha_for_generation to False.
[05/15/2024-20:05:17] [TRT-LLM] [W] remove_input_padding is enabled, while max_num_tokens is not set, setting to max_batch_sizemax_input_len.
It may not be optimal to set max_num_tokens=max_batch_sizemax_input_len when remove_input_padding is enabled, because the number of packed input tokens are very likely to be smaller, we strongly recommend to set max_num_tokens according to your workloads.
[05/15/2024-20:05:17] [TRT] [I] [MemUsageChange] Init CUDA: CPU +13, GPU +0, now: CPU 260, GPU 103 (MiB)
[05/15/2024-20:05:24] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +978, GPU +180, now: CPU 1374, GPU 283 (MiB)
[05/15/2024-20:05:24] [TRT-LLM] [I] Set nccl_plugin to None.
[05/15/2024-20:05:24] [TRT-LLM] [I] Set use_custom_all_reduce to True.
[05/15/2024-20:05:24] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/vocab_embedding/GATHER_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_0_output_0: first input has type BFloat16 but second input has type Float.
[05/15/2024-20:05:24] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/0/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/0/input_layernorm/SHUFFLE_1_output_0: first input has type BFloat16 but second input has type Float.
[TensorRT-LLM][WARNING] Fall back to unfused MHA because of unsupported head size 128 in sm_{75}.
[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: Unsupported data type, pre SM 80 GPUs do not support bfloat16 (/home/jenkins/agent/workspace/LLM/release-0.8/L0_PostMerge/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:444)
1 0x7e2ab62b0803 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38803) [0x7e2ab62b0803]
2 0x7e2ab62b0a9e /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x38a9e) [0x7e2ab62b0a9e]
3 0x7e2ab62d80eb tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(int, int, int, int, float, tensorrt_llm::kernels::PositionEmbeddingType, int, float, tensorrt_llm::kernels::RotaryScalingType, float, int, int, int, bool, tensorrt_llm::kernels::ContextFMHAType, bool, bool, int, bool, tensorrt_llm::kernels::AttentionMaskType, bool, int, nvinfer1::DataType, int, bool, bool, int, bool, bool, bool, bool, bool) + 219
4 0x7e2ab62d8b84 tensorrt_llm::plugins::GPTAttentionPluginCreator::createPlugin(char const*, nvinfer1::PluginFieldCollection const*) + 2644
5 0x7e2b4496030a /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x16030a) [0x7e2b4496030a]
6 0x7e2b44843433 /usr/local/lib/python3.10/dist-packages/tensorrt_bindings/tensorrt.so(+0x43433) [0x7e2b44843433]
7 0x5699b825810e /usr/bin/python3(+0x15a10e) [0x5699b825810e]
8 0x5699b824ea7b _PyObject_MakeTpCall + 603
9 0x5699b8266acb /usr/bin/python3(+0x168acb) [0x5699b8266acb]
10 0x5699b8246cfa _PyEval_EvalFrameDefault + 24906
11 0x5699b82589fc _PyFunction_Vectorcall + 124
12 0x5699b8267492 PyObject_Call + 290
13 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
14 0x5699b82589fc _PyFunction_Vectorcall + 124
15 0x5699b8267492 PyObject_Call + 290
16 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
17 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
18 0x5699b8267492 PyObject_Call + 290
19 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
20 0x5699b82589fc _PyFunction_Vectorcall + 124
21 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
22 0x5699b826386c _PyObject_Call_Prepend + 92
23 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
24 0x5699b824ea7b _PyObject_MakeTpCall + 603
25 0x5699b8248150 _PyEval_EvalFrameDefault + 30112
26 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
27 0x5699b8267492 PyObject_Call + 290
28 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
29 0x5699b82589fc _PyFunction_Vectorcall + 124
30 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
31 0x5699b826386c _PyObject_Call_Prepend + 92
32 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
33 0x5699b826742b PyObject_Call + 187
34 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
35 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
36 0x5699b824253c _PyEval_EvalFrameDefault + 6540
37 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
38 0x5699b8267492 PyObject_Call + 290
39 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
40 0x5699b82667f1 /usr/bin/python3(+0x1687f1) [0x5699b82667f1]
41 0x5699b8267492 PyObject_Call + 290
42 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
43 0x5699b82589fc _PyFunction_Vectorcall + 124
44 0x5699b824dcbd _PyObject_FastCallDictTstate + 365
45 0x5699b826386c _PyObject_Call_Prepend + 92
46 0x5699b837e700 /usr/bin/python3(+0x280700) [0x5699b837e700]
47 0x5699b826742b PyObject_Call + 187
48 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
49 0x5699b82589fc _PyFunction_Vectorcall + 124
50 0x5699b824126d _PyEval_EvalFrameDefault + 1725
51 0x5699b82589fc _PyFunction_Vectorcall + 124
52 0x5699b8267492 PyObject_Call + 290
53 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
54 0x5699b82589fc _PyFunction_Vectorcall + 124
55 0x5699b8267492 PyObject_Call + 290
56 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
57 0x5699b82589fc _PyFunction_Vectorcall + 124
58 0x5699b8267492 PyObject_Call + 290
59 0x5699b82435d7 _PyEval_EvalFrameDefault + 10791
60 0x5699b82589fc _PyFunction_Vectorcall + 124
61 0x5699b824126d _PyEval_EvalFrameDefault + 1725
62 0x5699b823d9c6 /usr/bin/python3(+0x13f9c6) [0x5699b823d9c6]
63 0x5699b8333256 PyEval_EvalCode + 134
64 0x5699b835e108 /usr/bin/python3(+0x260108) [0x5699b835e108]
65 0x5699b83579cb /usr/bin/python3(+0x2599cb) [0x5699b83579cb]
66 0x5699b835de55 /usr/bin/python3(+0x25fe55) [0x5699b835de55]
67 0x5699b835d338 _PyRun_SimpleFileObject + 424
68 0x5699b835cf83 _PyRun_AnyFileObject + 67
69 0x5699b834fa5e Py_RunMain + 702
70 0x5699b832602d Py_BytesMain + 45
71 0x7e2d15d3bd90 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7e2d15d3bd90]
72 0x7e2d15d3be40 __libc_start_main + 128
73 0x5699b8325f25 _start + 37
Traceback (most recent call last):
File “/usr/local/bin/trtllm-build”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 497, in main
parallel_build(source, build_config, args.output_dir, workers,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 420, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 392, in build_and_save
engine = build(build_config,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 282, in build
return build_model(model, build_config)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 198, in build_model
model(**inputs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py”, line 498, in forward
hidden_states = self.transformer.forward(**kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py”, line 202, in forward
hidden_states = self.layers.forward(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py”, line 255, in forward
hidden_states = layer(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py”, line 116, in forward
attention_output = self.attention(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py”, line 40, in call
output = self.forward(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/attention.py”, line 742, in forward
context, past_key_value = gpt_attention(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py”, line 561, in wrapper
outs = f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py”, line 3813, in gpt_attention
layer = default_trtnet().add_plugin_v2(plug_inputs, attn_plug)
TypeError: add_plugin_v2(): incompatible function arguments. The following argument types are supported:
1. (self: tensorrt_bindings.tensorrt.INetworkDefinition, inputs: List[tensorrt_bindings.tensorrt.ITensor], plugin: tensorrt_bindings.tensorrt.IPluginV2) → tensorrt_bindings.tensorrt.IPluginV2Layer

Invoked with: <tensorrt_bindings.tensorrt.INetworkDefinition object at 0x7e2b2d1ead70>, [<tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fcc7f0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9cb70>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d0b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d870>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fa26b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d370>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4fa28b0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9ce30>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f569f0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f56cb0>, <tensorrt_bindings.tensorrt.ITensor object at 0x7e2ab4f9d5b0>], None

Topic		Replies	Views
Supercharging Llama 3.1 across NVIDIA Platforms Technical Blog	13	460	September 17, 2024
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	2089	January 25, 2024
NVIDIA TensorRT-LLM 및 NVIDIA Triton Inference Server로 Meta Llama 3 성능 강화 Technical Blog - South Korea	0	373	May 3, 2024
Triton Inference Server + vLLM Backend on the NVIDIA Jetson AGX Orin 64GB Developer Kit Jetson Projects generative_ai	8	1310	June 16, 2025
LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM Technical Blog nim	0	202	July 7, 2025
Deploying a 1.3B GPT-3 Model with NVIDIA NeMo Megatron Technical Blog	3	1103	March 31, 2023
Boost Llama 3.3 70B Inference Throughput 3x with NVIDIA TensorRT-LLM Speculative Decoding Technical Blog llama	2	312	February 3, 2025
Tune and Deploy LoRA LLMs with NVIDIA TensorRT-LLM Technical Blog	2	665	April 18, 2024
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2929	February 25, 2026
Recommend Compute for running a TensorRT-LLM using LLama2 13B & 70B model TensorRT	2	1163	November 15, 2023

Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Obtain and start the basic docker image environment. (Below docker command is not working on colab)

Install dependencies, TensorRT-LLM requires Python 3.10

Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.

Build the Llama 8B model using a single GPU and BF16.

Obtain and start the basic docker image environment. (Below docker command is not working on colab)

Install dependencies, TensorRT-LLM requires Python 3.10

Install the stable version (corresponding to the cloned branch) of TensorRT-LLM.

Build the Llama 8B model using a single GPU and BF16.

Related topics