Turbocharging Meta Llama 3 Performance with NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server

Hi @richard.forman - The config files to modify are in the ‘all_models/inflight_batcher_llm’ directory in subfolders (preprocessing, postprocessing, tensor_llm_bls, ensemble, tensor_llm). The fill_template.py script helps modify these config files for you. It’s these following lines of code as described in the blogpost:

#Set the tokenizer_dir and engine_dir paths
HF_LLAMA_MODEL=TensorRT-LLM/Meta-Llama-3-8B-Instruct
ENGINE_PATH=tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},tokenizer_type:auto,triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

Hi @anjshah,

Thanks for clearing that up for me. I didn’t realise that these lines:

were meant to set environment variables. I thought they were variable definitions that needed to be added to some config file.

1 Like

Hi,

Do you have some results to understand if reported Llama 3 performance on evaluation benchmarks are reproduced to some degree with Llama 3 that is deployed with TensortLLM ?

Hi @anjshah,
Thank you for all of the help. I was able to successfully send the api call and got the proper response.

I do have some performance related questions though. I have this currently deployed on an AWS EC2 g5.8xlarge instance. What kind of performance can I expect? How many clients can be sending requests at a time? For heavy traffic will we need to set this up as an autoscaling group behind a load balancer?

1 Like

Hi @richard.forman - Triton inference server backend is already setup to use TensorRT-LLM’s batch manager to handle multiple concurrent requests without affecting latency. See an example to send multiple simultaneously requests to deployed model here.

Hi @matatov.n - we’ve the published some perf numbers here and we’ll soon be adding Llama3 and other newly added models’ perf numbers to this overview page. Meanwhile if you would like to take a stab at it, please follow these instructions to reproduce the benchmark results by using the TRT engine you built for Llama 3.

1 Like

Thanks you for your answer . I am also interested in benchmarking accuracy of compiled with TensorRT-LLM models on MMLU task , for example . Or other popular evaluation tasks.

Hi , everyone

I have a question : are token streaming, paged attention, and KV cache are available just after compilation with TensorRT-LLM and or does it need using Triton Inference Server ? What if I am just deploy the compiled model to GKE (after dockerization)
Thanks

Hi @matatov.n - yes, we have a mmlu script that you can use for this purpose

No, you don’t need Triton inference server to use these TensorRT-LLM features. You can follow this example with run script

1 Like

Hi @anjshah ,

Do you have some ready results to show that using TensortRT-LLM maintains original (i.e. Llama 3) results ?
I am afraid , running the evaluations by ourselves can be very expensive for us.

Yes we do and plan to post them early next week

1 Like

Hi ,

I am searching for Dockerfile that given rank0.engine and config.json files would expose inference server . Any link for reference ?

Thanks

Hi , @anjshah

Do you have a post with evaluations ?

Hi @matatov.n - if you mean the latest, Llama3 perf evaluations, yes we’ve updated those on the perf-overview page of the Github docs. If you meant the MMLU evaluations, we have provided the mmlu.py script and you can use as shown here

@anjshah I have another question about gRPC API. I used your example and config for building a docker image with triton and llama3 and /generate endpoint works well. But now I need switch to gRPC with steaming support. So I created a go gRPC client and inference request:

	request := proto.ModelInferRequest{
		Id:           uuid.New().String(),
		ModelName:    modelName,
		ModelVersion: "1",
		Inputs: []*proto.ModelInferRequest_InferInputTensor{
			{
				Name:     "text_input",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte(text)},
				},
			},
			{
				Name:     "max_tokens",
				Shape:    []int64{-1, -1},
				Datatype: "INT32",
				Contents: &proto.InferTensorContents{
					IntContents: []int32{maxTokens},
				},
			},
			{
				Name:     "stop_words",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte(stopWord)},
				},
			},
			{
				Name:     "bad_words",
				Shape:    []int64{-1, -1},
				Datatype: "BYTES",
				Contents: &proto.InferTensorContents{
					BytesContents: [][]byte{[]byte("")},
				},
			},
		},

		Outputs: []*proto.ModelInferRequest_InferRequestedOutputTensor{
			{
				Name: "text_output",
			},
		},
	}

In response from triton I received error calling sync inference endpoint rpc error: code = InvalidArgument desc = [request id: 98bb094c-7895-4448-926c-4c31d53b9be4] input 'max_tokens' batch size does not match other inputs for 'ensemble'
So what was wrong? I checked that each model has max_batch_size=64 and have no idea what to change in the config or in the request…

Hi @prots.igor - For your “max_tokens” input there seems to be a typo. Can you check after replacing “INT32” with “BYTES” as the other inputs?

I used checkpoints from this repo TheFloat16/Llama3-70b-Instruct-TRTLLM at main
I have three Tesla V100-PCIE-32GB. I built the engines from those checkpoints on TRT-LLM 0.8.0 in the docker container just like it was wrote in the tutorial.
Building engines went fine, but launching the server ends up with an error.

[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherw
ise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
[TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict)
[TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
[TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
[TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherw
ise
[TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
[TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
[TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
[TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
I0705 18:26:12.335485 693 python_be.cc:2404] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)
[TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Engine version 0.8.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] Parameter paged_state cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘paged_state’ not found
[TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘max_draft_len’ not found
[TensorRT-LLM][WARNING] Parameter paged_state cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key ‘paged_state’ not found
E0705 18:26:12.335788 694 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key ‘lora_config’ not found
E0705 18:26:12.335837 694 model_lifecycle.cc:638] failed to load ‘tensorrt_llm’ version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key ‘lora_config’ no
t found
I0705 18:26:12.335849 694 model_lifecycle.cc:773] failed to load ‘tensorrt_llm’
I0705 18:26:12.335951 694 server.cc:607]

Hi @shacotustra - can you please uninstall v0.8.0 and follow the same steps, but replace 0.8.0 with 0.9.0? And use the triton inference server v24.4.

When I try to build, an error occures

[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/77/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/77/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/77/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/78/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/78/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[TensorRT-LLM][WARNING] Note that alibi or sliding window attention are not supported for FMHA on Volta
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/78/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/78/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/78/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/layers/79/input_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/input_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/79/input_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[TensorRT-LLM][WARNING] Note that alibi or sliding window attention are not supported for FMHA on Volta
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/ELEMENTWISE_SUM_0_output_0 and LLaMAForCausalLM/transformer/layers/79/post_layernorm/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/post_layernorm/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/layers/79/post_layernorm/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/layers/79/ELEMENTWISE_SUM_1_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_0_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT] [W] IElementWiseLayer with inputs LLaMAForCausalLM/transformer/ln_f/REDUCE_AVG_0_output_0 and LLaMAForCausalLM/transformer/ln_f/SHUFFLE_1_output_0: first input has type Half but second input has type Float.
[07/05/2024-18:48:04] [TRT-LLM] [I] Build TensorRT engine Unnamed Network 0
[07/05/2024-18:48:04] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[07/05/2024-18:48:04] [TRT] [W] Unused Input: position_ids
[07/05/2024-18:48:04] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[07/05/2024-18:48:04] [TRT] [E] 4: [standardEngineBuilder.cpp::initCalibrationParams::1837] Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)
[07/05/2024-18:48:04] [TRT-LLM] [E] Engine building failed, please check the error log.
[07/05/2024-18:48:04] [TRT] [I] Serialized 59 bytes of code generator cache.
[07/05/2024-18:48:04] [TRT] [I] Serialized 0 timing cache entries
[07/05/2024-18:48:04] [TRT-LLM] [I] Timing cache serialized to model.cache
[07/05/2024-18:48:04] [TRT-LLM] [I] Serializing engine to /TensorRT-LLM/TensorRT-LLM/Llama3-70B-engines/rank0.engine…
Traceback (most recent call last):
File “/usr/local/bin/trtllm-build”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 440, in main
parallel_build(source, build_config, args.output_dir, workers,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 332, in parallel_build
passed = build_and_save(rank, rank % workers, ckpt_dir,
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/commands/build.py”, line 298, in build_and_save
engine.save(output_dir)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/builder.py”, line 566, in save
serialize_engine(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py”, line 105, in serialize_engine
f.write(engine)
TypeError: a bytes-like object is required, not ‘NoneType’
root@207d17ed1e43:/TensorRT-LLM/TensorRT-LLM#