[Hugging Face transformer models + pytorch_quantization] PTQ quantization int8 is slower than fp16

Description

When using pytorch_quantization with Hugging Face models, whatever the seq len, the batch size and the model, int-8 is always slower than FP16. TensorRT models are produced with trtexec (see below)

Many PDQ nodes are just before a transpose node and then the matmul. I am under the impression it may be a source of performance issue (Developer Guide :: NVIDIA Deep Learning TensorRT Documentation).

According to https://github.com/NVIDIA/sampleQAT/blob/master/postprocess_onnx.py:

    """
    This is a workaround to manually transpose the conv weights and remove
    the existing transpose nodes. Currently TRT has a limitation when there is
    a transpose node as an input to the weights of the conv layer. This utility 
    would be removed in future releases.
    """

May be linked to PTQ quantization int8 is slower than fp16 · Issue #1532 · NVIDIA/TensorRT · GitHub

Second point, it doesn’t seem that bert module (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/nn/modules/quant_bert.py) is enabled (https://github.com/NVIDIA/TensorRT/blob/main/tools/pytorch-quantization/pytorch_quantization/quant_modules.py#L26)


# int8 quantized models
[11/09/2021-11:51:11] [I] === Performance summary ===
[11/09/2021-11:51:11] [I] Throughput: 61.2925 qps
[11/09/2021-11:51:11] [I] Latency: min = 14.9854 ms, max = 26.0117 ms, mean = 16.2563 ms, median = 15.2119 ms, percentile(99%) = 22.6989 ms
[11/09/2021-11:51:11] [I] End-to-End Host Latency: min = 29.5244 ms, max = 44.1949 ms, mean = 32.2826 ms, median = 30.2827 ms, percentile(99%) = 43.3751 ms


# FP16 model - no QDQ nodes
[11/09/2021-11:52:29] [I] === Performance summary ===
[11/09/2021-11:52:29] [I] Throughput: 100.687 qps
[11/09/2021-11:52:29] [I] Latency: min = 9.50928 ms, max = 15.5975 ms, mean = 9.93139 ms, median = 9.64233 ms, percentile(99%) = 13.3743 ms
[11/09/2021-11:52:29] [I] End-to-End Host Latency: min = 18.1421 ms, max = 26.309 ms, mean = 19.6506 ms, median = 19.1113 ms, percentile(99%) = 24.8865 ms

int 8 Netron quantized model screenshot

Environment

TensorRT Version: 8.2 (preview)
NVIDIA GPU: 3090 RTX
NVIDIA Driver Version: 495.29.05
CUDA Version: 11.5
CUDNN Version: 8.3.0.98
Operating System: Linux Ubuntu 21.04
Python Version (if applicable): 3.9
PyTorch Version (if applicable): 1.10
Baremetal or Container (if so, version): Baremetal

Relevant Files

Get an error when trying to upload ONNX files (90Mb) around 20%…

However the notebook below can recreate artefacts from scratch in < 1 minute.

Steps To Reproduce

To recreate both not quantized model + quantized artefacts (need hugging face transformers + pytorch_quantization), run the notebook below (at the very end there are 2 trtexec commands).

based on https://github.com/NVIDIA/TensorRT/blob/master/tools/pytorch-quantization/examples/calibrate_quant_resnet50.ipynb

quantization (1).ipynb (1.7 MB)

1 Like

Hi, Please refer to the below links to perform inference in INT8

Thanks!

Thank you for your answer.
I followed the code of the example (I just “translated” it in Python) and with this PTQ process, the issue is exactly the same: int8 quantization has same latency/throughput than FP32 and is much slower than FP16.

Model (Google-drive public link):
https://drive.google.com/file/d/14wiCeBPTGtWRFdr8Z7-AVtlpCciHojxw/view?usp=sharing

calibration table (python generated):
calibration_cache.bin (17.2 KB)

Logs:

[11/19/2021-09:15:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +434, GPU +0, now: CPU 5951, GPU 5042 (MiB)
[11/19/2021-09:15:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 5951, GPU 5042 (MiB)
<class 'pycuda._driver.DeviceAllocation'>
[11/19/2021-09:15:41] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [I] [MemUsageSnapshot] Builder begin: CPU 6266 MiB, GPU 5122 MiB
[11/19/2021-09:15:42] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6268, GPU 5130 (MiB)
[11/19/2021-09:15:42] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer bert.embeddings.position_ids obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer 828 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Unsqueeze_0 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Unsqueeze_1 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Slice_14 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:43] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.506506ms to assign 490 blocks to 490 nodes requiring 34163905024 bytes.
[11/19/2021-09:15:43] [TRT] [I] Total Activation Memory: -195833344
[11/19/2021-09:15:43] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:15:43] [TRT] [I] Total Host Persistent Memory: 10944
[11/19/2021-09:15:43] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:15:43] [TRT] [I] Total Scratch Memory: 4194304
[11/19/2021-09:15:43] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[11/19/2021-09:15:47] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 4564.68ms to assign 153 blocks to 538 nodes requiring 1022379520 bytes.
[11/19/2021-09:15:47] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6270, GPU 5226 (MiB)
[11/19/2021-09:15:47] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation begin: CPU 6270 MiB, GPU 5210 MiB
[11/19/2021-09:15:47] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6270, GPU 5218 (MiB)
[11/19/2021-09:15:47] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation end: CPU 6270 MiB, GPU 6194 MiB
[11/19/2021-09:15:47] [TRT] [I] Starting Calibration.
[11/19/2021-09:15:48] [TRT] [I]   Calibrated batch 0 in 0.631816 seconds.
[11/19/2021-09:15:49] [TRT] [I]   Calibrated batch 1 in 0.606898 seconds.
[11/19/2021-09:15:49] [TRT] [I]   Calibrated batch 2 in 0.60381 seconds.
[11/19/2021-09:15:50] [TRT] [I]   Calibrated batch 3 in 0.605031 seconds.
[11/19/2021-09:15:50] [TRT] [I]   Calibrated batch 4 in 0.60427 seconds.
[11/19/2021-09:15:51] [TRT] [I]   Calibrated batch 5 in 0.607089 seconds.
[11/19/2021-09:15:52] [TRT] [I]   Calibrated batch 6 in 0.643803 seconds.
[11/19/2021-09:15:52] [TRT] [I]   Calibrated batch 7 in 0.613316 seconds.
[11/19/2021-09:15:53] [TRT] [I]   Calibrated batch 8 in 0.609152 seconds.
[11/19/2021-09:15:53] [TRT] [I]   Calibrated batch 9 in 0.607951 seconds.
[11/19/2021-09:15:54] [TRT] [I]   Calibrated batch 10 in 0.607577 seconds.
[11/19/2021-09:15:55] [TRT] [I]   Calibrated batch 11 in 0.606718 seconds.
[11/19/2021-09:15:55] [TRT] [I]   Calibrated batch 12 in 0.608924 seconds.
[11/19/2021-09:15:56] [TRT] [I]   Calibrated batch 13 in 0.613883 seconds.
[11/19/2021-09:15:56] [TRT] [I]   Calibrated batch 14 in 0.607731 seconds.
[11/19/2021-09:15:57] [TRT] [I]   Calibrated batch 15 in 0.609227 seconds.
[11/19/2021-09:15:58] [TRT] [I]   Calibrated batch 16 in 0.606875 seconds.
[11/19/2021-09:15:58] [TRT] [I]   Calibrated batch 17 in 0.607437 seconds.
[11/19/2021-09:15:59] [TRT] [I]   Calibrated batch 18 in 0.610393 seconds.
[11/19/2021-09:15:59] [TRT] [I]   Calibrated batch 19 in 0.609921 seconds.
[11/19/2021-09:15:59] [TRT] [I]   Post Processing Calibration data in 0.00207784 seconds.
[11/19/2021-09:15:59] [TRT] [I] Calibration completed in 17.7706 seconds.
[11/19/2021-09:16:00] [TRT] [I] Writing Calibration Cache for calibrator: TRT-8200-MinMaxCalibration
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 30) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 33) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 67) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 71) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 140) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 181) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 185) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 202) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 206) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 210) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 222) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 226) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 295) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 336) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 340) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 357) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 361) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 365) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 377) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 381) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 450) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 491) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 495) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 512) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 516) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 520) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 532) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 536) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 605) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 646) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 650) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 667) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 671) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 675) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 687) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 691) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 760) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 801) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 805) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 822) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 826) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 830) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 842) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 846) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 915) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 956) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 960) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 977) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 981) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 985) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 997) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 1001) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 1012) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6268, GPU 5130 (MiB)
[11/19/2021-09:16:00] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[11/19/2021-09:16:33] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.000995ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:16:33] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:16:33] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:16:33] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:16:33] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:16:33] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:16:33] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 979 MiB
[11/19/2021-09:16:33] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.028239ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:16:58] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.001188ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:16:58] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:16:58] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:16:58] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:16:58] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:16:58] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:16:58] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 86 MiB, GPU 979 MiB
[11/19/2021-09:16:58] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.018331ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:17:23] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.00108ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:17:23] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:17:23] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:17:23] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:17:23] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:17:23] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:17:23] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 86 MiB, GPU 979 MiB
[11/19/2021-09:17:23] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.027757ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:17:23] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6538, GPU 5263 (MiB)
[11/19/2021-09:17:23] [TRT] [I] [MemUsageSnapshot] Builder end: CPU 6527 MiB, GPU 5247 MiB
[11/19/2021-09:17:24] [TRT] [I] Loaded engine size: 346 MiB
[11/19/2021-09:17:24] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 6614 MiB, GPU 5117 MiB
[11/19/2021-09:17:24] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6874, GPU 5213 (MiB)
[11/19/2021-09:17:24] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine end: CPU 6874 MiB, GPU 5205 MiB
[11/19/2021-09:17:25] [TRT] [I] Loaded engine size: 346 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 6874 MiB, GPU 5206 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7134, GPU 5302 (MiB)
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine end: CPU 7134 MiB, GPU 5294 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation begin: CPU 6527 MiB, GPU 5206 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6527, GPU 5214 (MiB)
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation end: CPU 6529 MiB, GPU 6488 MiB

Important part of the code below:

class Calibrator(trt.IInt8Calibrator):
    def __init__(self):
        trt.IInt8Calibrator.__init__(self)
        self.algorithm = trt.CalibrationAlgoType.MINMAX_CALIBRATION
        self.batch_size = 32
        # fake data
        input_list: List[ndarray] = [np.zeros((32, 512), dtype=np.int32) for _ in range(3)]
        # allocate GPU memory for input tensors
        self.device_inputs: List[DeviceAllocation] = [cuda.mem_alloc(tensor.nbytes) for tensor in input_list]
        for h_input, d_input in zip(input_list, self.device_inputs):
            cuda.memcpy_htod_async(d_input, h_input)  # host to GPU
        self.count = 0

    def get_algorithm(self):
        return trt.CalibrationAlgoType.MINMAX_CALIBRATION

    def get_batch_size(self):
        return self.batch_size

    def get_batch(self, names, p_str=None):
        self.count += 1
        if self.count > 20:
            return []
        # return pointers to arrays
        return [int(d) for d in self.device_inputs]

    def read_calibration_cache(self):
        return None

    def write_calibration_cache(self, cache):
        with open("calibration_cache.bin", "wb") as f:
            f.write(cache)

    def free(self):
        for dinput in self.device_inputs:
            dinput.free()

and


def build_engine(
    runtime: Runtime,
    onnx_file_path: str,
    logger: Logger,
    min_shape: Tuple[int, int],
    optimal_shape: Tuple[int, int],
    max_shape: Tuple[int, int],
    workspace_size: int,
) -> ICudaEngine:
    with trt.Builder(logger) as builder:  # type: Builder
        with builder.create_network(
            flags=1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        ) as network_definition:  # type: INetworkDefinition
            with trt.OnnxParser(network_definition, logger) as parser:  # type: OnnxParser
                builder.max_batch_size = max_shape[0]  # max batch size
                config: IBuilderConfig = builder.create_builder_config()
                config.min_timing_iterations = 1
                config.avg_timing_iterations = 1
                config.max_workspace_size = workspace_size
                # to enable complete trt inspector debugging, only for TensorRT >= 8.2
                # config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
                # CUBLAS_LT only for TensorRT >= 8
                config.set_tactic_sources(
                    tactic_sources=1 << int(trt.TacticSource.CUBLAS) | 1 << int(trt.TacticSource.CUBLAS_LT)
                )
                # config.set_flag(trt.BuilderFlag.FP16)
                config.set_flag(trt.BuilderFlag.INT8)
                config.set_quantization_flag(trt.QuantizationFlag.CALIBRATE_BEFORE_FUSION)
                config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
                config.int8_calibrator = Calibrator()
                # https://github.com/NVIDIA/TensorRT/issues/1196 (sometimes big diff in output when using FP16)
                config.set_flag(trt.BuilderFlag.STRICT_TYPES)
                with open(onnx_file_path, "rb") as f:
                    parser.parse(f.read())
                profile: IOptimizationProfile = builder.create_optimization_profile()
                for num_input in range(network_definition.num_inputs):
                    profile.set_shape(
                        input=network_definition.get_input(num_input).name,
                        min=min_shape,
                        opt=optimal_shape,
                        max=max_shape,
                    )
                config.add_optimization_profile(profile)

                trt_engine = builder.build_serialized_network(network_definition, config)
                engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
                assert engine is not None, "error during engine generation :-("
                return engine

For completeness, please find below logs from trtexec:

  • fp16
/usr/src/tensorrt/bin/trtexec --onnx=./triton_models/model-original.onnx --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=10000 --fp16 --verbose  --dumpProfile --separateProfileRun &> trtexec_fp16.log

trtexec_fp16.log (381.0 KB)

  • int8 (no calibration table provided)
/usr/src/tensorrt/bin/trtexec --onnx=./triton_models/model-original.onnx --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=10000 --int8 --verbose  --dumpProfile --separateProfileRun &> trtexec_int8.logs

trtexec_int8.log (675.8 KB)

1 Like

Hi,

We recommend you to please post your concern on Issues · NVIDIA/TensorRT · GitHub to get better help.

Thank you.

thank you, I fixed the issue and made it a lib GitHub - ELS-RD/transformer-deploy: Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀