Thank you for your answer.
I followed the code of the example (I just “translated” it in Python) and with this PTQ process, the issue is exactly the same: int8 quantization has same latency/throughput than FP32 and is much slower than FP16.
Model (Google-drive public link):
https://drive.google.com/file/d/14wiCeBPTGtWRFdr8Z7-AVtlpCciHojxw/view?usp=sharing
calibration table (python generated):
calibration_cache.bin (17.2 KB)
Logs:
[11/19/2021-09:15:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +434, GPU +0, now: CPU 5951, GPU 5042 (MiB)
[11/19/2021-09:15:40] [TRT] [I] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 5951, GPU 5042 (MiB)
<class 'pycuda._driver.DeviceAllocation'>
[11/19/2021-09:15:41] [TRT] [W] onnx2trt_utils.cpp:366: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [W] Output type must be INT32 for shape outputs
[11/19/2021-09:15:42] [TRT] [I] [MemUsageSnapshot] Builder begin: CPU 6266 MiB, GPU 5122 MiB
[11/19/2021-09:15:42] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6268, GPU 5130 (MiB)
[11/19/2021-09:15:42] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] Calibration Profile is not defined. Running calibration with Profile 0
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer bert.embeddings.position_ids obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer 828 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Unsqueeze_0 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Unsqueeze_1 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:42] [TRT] [W] No implementation of layer Slice_14 obeys the requested constraints in strict mode. No conforming implementation was found i.e. requested layer computation precision and output precision types are ignored, using the fastest implementation.
[11/19/2021-09:15:43] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.506506ms to assign 490 blocks to 490 nodes requiring 34163905024 bytes.
[11/19/2021-09:15:43] [TRT] [I] Total Activation Memory: -195833344
[11/19/2021-09:15:43] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:15:43] [TRT] [I] Total Host Persistent Memory: 10944
[11/19/2021-09:15:43] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:15:43] [TRT] [I] Total Scratch Memory: 4194304
[11/19/2021-09:15:43] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[11/19/2021-09:15:47] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 4564.68ms to assign 153 blocks to 538 nodes requiring 1022379520 bytes.
[11/19/2021-09:15:47] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6270, GPU 5226 (MiB)
[11/19/2021-09:15:47] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation begin: CPU 6270 MiB, GPU 5210 MiB
[11/19/2021-09:15:47] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6270, GPU 5218 (MiB)
[11/19/2021-09:15:47] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation end: CPU 6270 MiB, GPU 6194 MiB
[11/19/2021-09:15:47] [TRT] [I] Starting Calibration.
[11/19/2021-09:15:48] [TRT] [I] Calibrated batch 0 in 0.631816 seconds.
[11/19/2021-09:15:49] [TRT] [I] Calibrated batch 1 in 0.606898 seconds.
[11/19/2021-09:15:49] [TRT] [I] Calibrated batch 2 in 0.60381 seconds.
[11/19/2021-09:15:50] [TRT] [I] Calibrated batch 3 in 0.605031 seconds.
[11/19/2021-09:15:50] [TRT] [I] Calibrated batch 4 in 0.60427 seconds.
[11/19/2021-09:15:51] [TRT] [I] Calibrated batch 5 in 0.607089 seconds.
[11/19/2021-09:15:52] [TRT] [I] Calibrated batch 6 in 0.643803 seconds.
[11/19/2021-09:15:52] [TRT] [I] Calibrated batch 7 in 0.613316 seconds.
[11/19/2021-09:15:53] [TRT] [I] Calibrated batch 8 in 0.609152 seconds.
[11/19/2021-09:15:53] [TRT] [I] Calibrated batch 9 in 0.607951 seconds.
[11/19/2021-09:15:54] [TRT] [I] Calibrated batch 10 in 0.607577 seconds.
[11/19/2021-09:15:55] [TRT] [I] Calibrated batch 11 in 0.606718 seconds.
[11/19/2021-09:15:55] [TRT] [I] Calibrated batch 12 in 0.608924 seconds.
[11/19/2021-09:15:56] [TRT] [I] Calibrated batch 13 in 0.613883 seconds.
[11/19/2021-09:15:56] [TRT] [I] Calibrated batch 14 in 0.607731 seconds.
[11/19/2021-09:15:57] [TRT] [I] Calibrated batch 15 in 0.609227 seconds.
[11/19/2021-09:15:58] [TRT] [I] Calibrated batch 16 in 0.606875 seconds.
[11/19/2021-09:15:58] [TRT] [I] Calibrated batch 17 in 0.607437 seconds.
[11/19/2021-09:15:59] [TRT] [I] Calibrated batch 18 in 0.610393 seconds.
[11/19/2021-09:15:59] [TRT] [I] Calibrated batch 19 in 0.609921 seconds.
[11/19/2021-09:15:59] [TRT] [I] Post Processing Calibration data in 0.00207784 seconds.
[11/19/2021-09:15:59] [TRT] [I] Calibration completed in 17.7706 seconds.
[11/19/2021-09:16:00] [TRT] [I] Writing Calibration Cache for calibrator: TRT-8200-MinMaxCalibration
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 30) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 33) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 67) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 71) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 140) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 181) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 185) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 202) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 206) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 210) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 222) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 226) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 295) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 336) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 340) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 357) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 361) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 365) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 377) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 381) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 450) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 491) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 495) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 512) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 516) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 520) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 532) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 536) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 605) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 646) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 650) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 667) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 671) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 675) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 687) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 691) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 760) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 801) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 805) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 822) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 826) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 830) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 842) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 846) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 915) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 956) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 960) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 977) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 981) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 985) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 997) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 1001) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [W] Missing scale and zero-point for tensor (Unnamed Layer* 1012) [Constant]_output, expect fall back to non-int8 implementation for any layer consuming or producing given tensor
[11/19/2021-09:16:00] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6268, GPU 5130 (MiB)
[11/19/2021-09:16:00] [TRT] [I] Timing cache disabled. Turning it on will improve builder speed.
[11/19/2021-09:16:33] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.000995ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:16:33] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:16:33] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:16:33] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:16:33] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:16:33] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:16:33] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 979 MiB
[11/19/2021-09:16:33] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.028239ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:16:58] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.001188ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:16:58] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:16:58] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:16:58] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:16:58] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:16:58] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:16:58] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 86 MiB, GPU 979 MiB
[11/19/2021-09:16:58] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.018331ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:17:23] [TRT] [I] [BlockAssignment] Algorithm Linear took 0.00108ms to assign 4 blocks to 4 nodes requiring 10485858305 bytes.
[11/19/2021-09:17:23] [TRT] [I] Total Activation Memory: 1895923713
[11/19/2021-09:17:23] [TRT] [I] Detected 3 inputs and 1 output network tensors.
[11/19/2021-09:17:23] [TRT] [I] Total Host Persistent Memory: 736
[11/19/2021-09:17:23] [TRT] [I] Total Device Persistent Memory: 0
[11/19/2021-09:17:23] [TRT] [I] Total Scratch Memory: 1333854208
[11/19/2021-09:17:23] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 86 MiB, GPU 979 MiB
[11/19/2021-09:17:23] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 0.027757ms to assign 3 blocks to 6 nodes requiring 1333952512 bytes.
[11/19/2021-09:17:23] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6538, GPU 5263 (MiB)
[11/19/2021-09:17:23] [TRT] [I] [MemUsageSnapshot] Builder end: CPU 6527 MiB, GPU 5247 MiB
[11/19/2021-09:17:24] [TRT] [I] Loaded engine size: 346 MiB
[11/19/2021-09:17:24] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 6614 MiB, GPU 5117 MiB
[11/19/2021-09:17:24] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6874, GPU 5213 (MiB)
[11/19/2021-09:17:24] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine end: CPU 6874 MiB, GPU 5205 MiB
[11/19/2021-09:17:25] [TRT] [I] Loaded engine size: 346 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine begin: CPU 6874 MiB, GPU 5206 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7134, GPU 5302 (MiB)
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] deserializeCudaEngine end: CPU 7134 MiB, GPU 5294 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation begin: CPU 6527 MiB, GPU 5206 MiB
[11/19/2021-09:17:25] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6527, GPU 5214 (MiB)
[11/19/2021-09:17:25] [TRT] [I] [MemUsageSnapshot] ExecutionContext creation end: CPU 6529 MiB, GPU 6488 MiB
Important part of the code below:
class Calibrator(trt.IInt8Calibrator):
def __init__(self):
trt.IInt8Calibrator.__init__(self)
self.algorithm = trt.CalibrationAlgoType.MINMAX_CALIBRATION
self.batch_size = 32
# fake data
input_list: List[ndarray] = [np.zeros((32, 512), dtype=np.int32) for _ in range(3)]
# allocate GPU memory for input tensors
self.device_inputs: List[DeviceAllocation] = [cuda.mem_alloc(tensor.nbytes) for tensor in input_list]
for h_input, d_input in zip(input_list, self.device_inputs):
cuda.memcpy_htod_async(d_input, h_input) # host to GPU
self.count = 0
def get_algorithm(self):
return trt.CalibrationAlgoType.MINMAX_CALIBRATION
def get_batch_size(self):
return self.batch_size
def get_batch(self, names, p_str=None):
self.count += 1
if self.count > 20:
return []
# return pointers to arrays
return [int(d) for d in self.device_inputs]
def read_calibration_cache(self):
return None
def write_calibration_cache(self, cache):
with open("calibration_cache.bin", "wb") as f:
f.write(cache)
def free(self):
for dinput in self.device_inputs:
dinput.free()
and
def build_engine(
runtime: Runtime,
onnx_file_path: str,
logger: Logger,
min_shape: Tuple[int, int],
optimal_shape: Tuple[int, int],
max_shape: Tuple[int, int],
workspace_size: int,
) -> ICudaEngine:
with trt.Builder(logger) as builder: # type: Builder
with builder.create_network(
flags=1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
) as network_definition: # type: INetworkDefinition
with trt.OnnxParser(network_definition, logger) as parser: # type: OnnxParser
builder.max_batch_size = max_shape[0] # max batch size
config: IBuilderConfig = builder.create_builder_config()
config.min_timing_iterations = 1
config.avg_timing_iterations = 1
config.max_workspace_size = workspace_size
# to enable complete trt inspector debugging, only for TensorRT >= 8.2
# config.profiling_verbosity = trt.ProfilingVerbosity.DETAILED
# CUBLAS_LT only for TensorRT >= 8
config.set_tactic_sources(
tactic_sources=1 << int(trt.TacticSource.CUBLAS) | 1 << int(trt.TacticSource.CUBLAS_LT)
)
# config.set_flag(trt.BuilderFlag.FP16)
config.set_flag(trt.BuilderFlag.INT8)
config.set_quantization_flag(trt.QuantizationFlag.CALIBRATE_BEFORE_FUSION)
config.set_flag(trt.BuilderFlag.DISABLE_TIMING_CACHE)
config.int8_calibrator = Calibrator()
# https://github.com/NVIDIA/TensorRT/issues/1196 (sometimes big diff in output when using FP16)
config.set_flag(trt.BuilderFlag.STRICT_TYPES)
with open(onnx_file_path, "rb") as f:
parser.parse(f.read())
profile: IOptimizationProfile = builder.create_optimization_profile()
for num_input in range(network_definition.num_inputs):
profile.set_shape(
input=network_definition.get_input(num_input).name,
min=min_shape,
opt=optimal_shape,
max=max_shape,
)
config.add_optimization_profile(profile)
trt_engine = builder.build_serialized_network(network_definition, config)
engine: ICudaEngine = runtime.deserialize_cuda_engine(trt_engine)
assert engine is not None, "error during engine generation :-("
return engine
For completeness, please find below logs from trtexec
:
/usr/src/tensorrt/bin/trtexec --onnx=./triton_models/model-original.onnx --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=10000 --fp16 --verbose --dumpProfile --separateProfileRun &> trtexec_fp16.log
trtexec_fp16.log (381.0 KB)
- int8 (no calibration table provided)
/usr/src/tensorrt/bin/trtexec --onnx=./triton_models/model-original.onnx --shapes=input_ids:32x512,attention_mask:32x512,token_type_ids:32x512 --workspace=10000 --int8 --verbose --dumpProfile --separateProfileRun &> trtexec_int8.logs
trtexec_int8.log (675.8 KB)