How to generate TRT engine from TAO on Triton-Server (TensorRT incompatible)

Recently, I successfully trained a Classification_tf2 model on Nvidia TAO and exported the model in ETLT format. However, I couldn’t find an easy way to deploy the model on Triton Server like I could with DeepStream. In DeepStream, it’s easy to generate the TRT engine file from etlt, so I don’t have to worry about TensorRT versions as it’s done automatically.

In Triton Server, I haven’t found a built-in option for automatic deployment from etlt to the engine file. I’ve been researching some alternatives:

  1. TAO Deploy: It’s a Docker image that can generate the TRT engine file, but it doesn’t provide control over the TensorRT version. For example, the Triton Inference Server Container 23.06 uses TensorRT 8.6.1.6. In this case, I would need to install TAO Deploy through a wheel, but nvidia-tao-deploy is not supported on Ubuntu 22.04, which uses Python 3.10. I tried using Python 3.9 and 3.8, but encountered errors. It’s been quite challenging, and nothing seems to work.
  2. TAO-Converter: It’s a binary used to decrypt a .etlt file from TAO Toolkit and generate a TensorRT engine. The latest version uses TensorRT 8.5.2.2, which means it’s not compatible with Triton Inference Server Container 23.06, as it requires version 8.6.1.6.

Is there a simpler way to generate the TensorRT engine file in Triton Server? It’s important to note that my training environment uses a specific type of GPU, and I have multiple Triton Server instances with different GPUs. Each GPU requires a compatible TensorRT engine.

I can easily generate a TensorRT engine from ONNX models of YOLOv7. Using trtexec is straightforward, unlike the cumbersome process with etlt. Am I doing something wrong, or is the integration of TAO with Triton Server challenging?

You can login the Triton Inference Server Container 23.06 you mentioned, and download the tao-converter and generate TensorRT engine. Ignore the minor version of tao-converter as long as the engine can be generated successfully.

Hi,
The Triton Inference Server Container 23.06 use Ubuntu 22.04
Ubuntu 22.04 has upgraded libssl to 3 and does not propose libssl1.1
So the tao-conveter generate this error:
tao-converter: error while loading shared libraries: libcrypto.so.1.1: cannot open shared object file: No such file or directory
I forced the installation of libssl1.1 by adding the ubuntu 20.04 source:

echo "deb http://security.ubuntu.com/ubuntu focal-security main" | tee /etc/apt/sources.list.d/focal-security.list
apt-get update
apt-get install libssl1.1

Then deleted the focal-security list file created:

rm /etc/apt/sources.list.d/focal-security.list

All seems to work.
below output:

[INFO] [MemUsageChange] Init CUDA: CPU +518, GPU +0, now: CPU 523, GPU 253 (MiB)
[INFO] [MemUsageChange] Init builder kernel library: CPU +883, GPU +172, now: CPU 1483, GPU 425 (MiB)
[WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[INFO] ----------------------------------------------------------------
[INFO] Input filename:   /tmp/fileb5aqZz
[INFO] ONNX IR version:  0.0.7
[INFO] Opset version:    13
[INFO] Producer name:    tf2onnx
[INFO] Producer version: 1.12.0 ddca3a
[INFO] Domain:
[INFO] Model version:    0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[WARNING] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[INFO] Detected input dimensions from the model: (-1, 3, 256, 256)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 3, 256, 256) for input: input:0
[INFO] Using optimization profile opt shape: (8, 3, 256, 256) for input: input:0
[INFO] Using optimization profile max shape: (16, 3, 256, 256) for input: input:0
[INFO] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[INFO] Graph optimization time: 0.0314244 seconds.
[INFO] Reading Calibration Cache for calibrator: EntropyCalibration2
[INFO] Generated calibration scales using calibration cache. Make sure that calibration cache has latest scales.
[INFO] To regenerate calibration cache, please delete the existing one. TensorRT will generate a new calibration cache.
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block1a_se_squeeze/Mean_Squeeze__402:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block2a_se_squeeze/Mean_Squeeze__398:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block2b_se_squeeze/Mean_Squeeze__380:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block3a_se_squeeze/Mean_Squeeze__384:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block3b_se_squeeze/Mean_Squeeze__388:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block4a_se_squeeze/Mean_Squeeze__408:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block4b_se_squeeze/Mean_Squeeze__406:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block4c_se_squeeze/Mean_Squeeze__404:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block5a_se_squeeze/Mean_Squeeze__390:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block5b_se_squeeze/Mean_Squeeze__396:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block5c_se_squeeze/Mean_Squeeze__386:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block6a_se_squeeze/Mean_Squeeze__400:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block6b_se_squeeze/Mean_Squeeze__392:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block6c_se_squeeze/Mean_Squeeze__378:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block6d_se_squeeze/Mean_Squeeze__394:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor StatefulPartitionedCall/efficientnet-b0/block7a_se_squeeze/Mean_Squeeze__382:0, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[WARNING] Missing scale and zero-point for tensor (Unnamed Layer* 412) [Softmax]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[INFO] Graph optimization time: 0.141812 seconds.
[INFO] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[INFO] Local timing cache in use. Profiling results in this builder pass will not be stored.
[INFO] Detected 1 inputs and 1 output network tensors.
[INFO] Total Host Persistent Memory: 474592
[INFO] Total Device Persistent Memory: 343552
[INFO] Total Scratch Memory: 36352
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 2 MiB, GPU 65 MiB
[INFO] [BlockAssignment] Started assigning block shifts. This will take 129 steps to complete.
[INFO] [BlockAssignment] Algorithm ShiftNTopDown took 5.78666ms to assign 5 blocks to 129 nodes requiring 16777728 bytes.
[INFO] Total Activation Memory: 16777728
[INFO] (Sparsity) Layers eligible for sparse math:
[INFO] (Sparsity) TRT inference plan picked sparse implementation for layers:
[WARNING] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[WARNING] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[WARNING] Check verbose logs for the list of affected weights.
[WARNING] - 65 weights are affected by this issue: Detected subnormal FP16 values.
[WARNING] - 19 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[WARNING] - 2 weights are affected by this issue: Detected finite FP32 values which would overflow in FP16 and converted them to the closest finite FP16 value.
[INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +1, GPU +4, now: CPU 1, GPU 4 (MiB)
2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.