TensorRT --- non-int8 fallback when trying to calibrate ONNX model

Description

Hello,

I am trying to calibrate an ONNX model in INT8 precision using the TensorRT Python API. The resulting engine file is 1/4 the size of the original ONNX model with ~2% loss in accuracy, however, when calibrating I get these warnings: (I use IInt8EntropyCalibrator2)

Then I am trying to run inference using the INT8 engine file in Deepstream 6.4, but the model is still slow, almost as slow as the original FP32 ONNX model. Is it because of these warnings that I get? Or is it something else regarding the Deepstream config file?

Thank you!

Environment

TensorRT Version: 8.6.1
GPU Type: RTX 4060
Nvidia Driver Version: 555.22
CUDA Version: 12.4
CUDNN Version:
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10.12
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 2.2.1
Baremetal or Container (if container which image + tag):

Relevant Files

This is the Python script I use for calibrating to INT8:

import torch
import tensorrt as trt
import pandas as pd

from pathlib import Path
from polygraphy.backend.trt import Calibrator, CreateConfig, EngineFromNetwork, Profile, NetworkFromOnnxPath, TrtRunner, SaveEngine
from torch.utils.data import DataLoader
from torchvision import transforms
from oml.datasets.base import DatasetQueryGallery
from oml.utils.dataframe_format import check_retrieval_dataframe_format

def dataloader(loader, input_name, dtype):
        for i, images in enumerate(loader):
            yield {input_name: images['input_tensors'].to(dtype=dtype).numpy()}
            
def calibrate(img_size, batch_size, dtype, model_name, input_name):
    profile = Profile()
    profile.add(name=input_name, 
                min=(batch_size, 3, img_size[0], img_size[1]), 
                opt=(batch_size, 3, img_size[0], img_size[1]), 
                max=(batch_size, 3, img_size[0], img_size[1]))
    
    onnx_path = model_name
    df = pd.read_csv("oml_dataset.csv")
    dataset_root = Path("data")

    check_retrieval_dataframe_format(df=df, dataset_root=dataset_root)
    val_df = df[df['split'] == 'validation']

    transform = transforms.Compose([
            transforms.Resize(img_size, interpolation=3),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225], inplace=True)
    ])

    val_dataset = DatasetQueryGallery(val_df, dataset_root=dataset_root, transform=transform)
    loader = DataLoader(val_dataset, batch_size=batch_size)
    
    calibrator = Calibrator( 
    data_loader = dataloader(loader, input_name, dtype), 
    cache = onnx_path.replace(".onnx", "_calibration.cache").split('/')[-1],
    batch_size = batch_size
    )
    
    engine = EngineFromNetwork(
    network = NetworkFromOnnxPath(onnx_path),
    config = CreateConfig(
        int8 = True,
        calibrator = calibrator,
        profiles = [profile],
        profiling_verbosity = trt.ProfilingVerbosity.DETAILED,
        sparse_weights = False),
    )  
    
    engine_path = onnx_path.replace(".onnx", ".engine").split('/')[-1]
    build_engine = SaveEngine(engine = engine, path=engine_path)
    build_engine()
    
if __name__ == "__main__":
    img_size = (256, 128)
    batch_size = 32
    input_name = 'input'
    dtype = torch.float32
    model_name = 'vit_reid_embed_512_fp32_acc_8509.onnx'
    calibrate(img_size, batch_size, dtype, model_name, input_name)

This is the Deepstream config file:

[property]
gpu-id=0
model-color-format=0
onnx-file=/app/V2/Pipeline/Models/ReId/vit_reid_embed_512_fp32_acc_8509.onnx
model-engine-file=/app/V2/Pipeline/Models/ReId/vit_reid_embed_512_fp32_acc_8509.onnx_b32_gpu0_int8.engine
int8-calib-file=/app/V2/Pipeline/Models/ReId/vit_reid_embed_512_fp32_acc_8509_calibration.cache
network-mode=1
batch-size=32
interval=0
gie-unique-id=2
process-mode=2
network-type=100
output-tensor-meta=1
infer-dims=3;256;128
tensor-meta-pool-size=256
scaling-filter=2
operate-on-class-ids=1
net-scale-factor=0.017354
offsets=123.675000;116.280000;103.530000
maintain-aspect-ratio=0
symmetric-padding=0

This is a link to the ONNX model: Dropbox

Steps To Reproduce

Please include:

  • Exact steps/commands to build your repro
  • Exact steps/commands to run your repro
  • Full traceback of errors encountered
1 Like

The missing scales and zero-points for some tensors suggest that the calibration process might be incomplete. This could happen if the calibration data wasn’t representative of the actual data you’ll use for inference.

Please refer to the following, which may help you.

We are moving this post to the Deepstream forum to get better help regarding the above.

Thank you.

Is the GPU loading high when you run the DeepStream pipeline with the INT8 engine?

Hi, sorry for the late reply!
In INT8 mode, when running inference on the same video, there are 2465MB occupied, and in FP32, there are 2578MB occupied

  1. Can you measure the GPU loading with “nvidia-smi dmon” command when running the deepstream-app case?
  2. Can you run with “trtexec --loadEngine=” to test the INT8 engine’s perormance?

So this is what I get when I’m running the nvidia-smi dmon command:

gpu pwr gtemp mtemp sm mem enc dec jpg ofa mclk pclk

Idx W C C % % % % % % MHz MHz

0      1     49      -      0      0      0      0      0      0    405    210
0      1     49      -      0      0      0      0      0      0    405    210
0      5     49      -     37     45      0      7      0      0    405    210
0     49     58      -     46     48      0      0      0      0   8000   1875
0     60     58      -     93     76      0      3      0      0   8000   1890
0     59     60      -     95     68      0      3      0      0   8000   1800
0     59     60      -     97     81      0      1      0      0   8000   1725
0     63     60      -     99     88      0      2      0      0   8000   1755
0     60     61      -     95     75      0      4      0      0   8000   1875
0     62     62      -     95     79      0      4      0      0   8000   1890
0     63     62      -     96     73      0      4      0      0   8000   1920
0     64     63      -     96     74      0      4      0      0   8000   1890
0     60     62      -     97     81      0      3      0      0   8000   1770
0     22     58      -     54     50      0      0      0      0   8000   2490
0     21     56      -      0      0      0      0      0      0   7000   2490
0     15     56      -     53     46      0      0      0      0   8000   1845
0     13     55      -      0      0      0      0      0      0   7000   1845
0     12     55      -      0      0      0      0      0      0   6000    345
0     11     55      -      0      0      0      0      0      0   6000    225

As for the trtexec command, the system can’t find it

The “trtexec” can be found from “/usr/src/tensorrt/bin/trtexec”, or you can use NVIDIA/TensorRT: NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. (github.com) to build the “trtexec” tool.
According to the GPU loading, the GPU is overloaded. You need to use “trtexec” to measure the model performance on the platform.

Here is the performance summary for the INT8 model:

[07/01/2024-09:01:14] [I] === Build Options ===
[07/01/2024-09:01:14] [I] Max batch: explicit batch
[07/01/2024-09:01:14] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[07/01/2024-09:01:14] [I] minTiming: 1
[07/01/2024-09:01:14] [I] avgTiming: 8
[07/01/2024-09:01:14] [I] Precision: FP32+INT8

[07/01/2024-09:01:17] [I] Engine loaded in 0.365692 sec.
[07/01/2024-09:01:17] [I] [TRT] Loaded engine size: 87 MiB
[07/01/2024-09:01:17] [V] [TRT] Deserialization required 46713 microseconds.
[07/01/2024-09:01:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +84, now: CPU 0, GPU 84 (MiB)
[07/01/2024-09:01:17] [I] Engine deserialized in 0.0795864 sec.
[07/01/2024-09:01:17] [V] [TRT] Total per-runner device persistent memory is 1536
[07/01/2024-09:01:17] [V] [TRT] Total per-runner host persistent memory is 461776
[07/01/2024-09:01:17] [V] [TRT] Allocated activation device memory of size 278396928
[07/01/2024-09:01:17] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +266, now: CPU 1, GPU 350 (MiB)
[07/01/2024-09:01:17] [V] [TRT] CUDA lazy loading is enabled.
[07/01/2024-09:01:17] [I] Setting persistentCacheLimit to 0 bytes.
[07/01/2024-09:01:17] [V] Using enqueueV3.
[07/01/2024-09:01:17] [I] Using random values for input input
[07/01/2024-09:01:17] [I] Input binding for input with dimensions 32x3x306x154 is created.
[07/01/2024-09:01:17] [I] Output binding for output with dimensions 32x512 is created.
[07/01/2024-09:01:17] [I] Starting inference
[07/01/2024-09:01:20] [I] Warmup completed 1 queries over 200 ms
[07/01/2024-09:01:20] [I] Timing trace has 30 queries over 3.17 s
[07/01/2024-09:01:20] [I]
[07/01/2024-09:01:20] [I] === Trace details ===
[07/01/2024-09:01:20] [I] Trace averages of 10 runs:
[07/01/2024-09:01:20] [I] Average on 10 runs - GPU latency: 103.317 ms - Host latency: 104.757 ms (enqueue 1.65581 ms)
[07/01/2024-09:01:20] [I] Average on 10 runs - GPU latency: 104.421 ms - Host latency: 105.865 ms (enqueue 1.67887 ms)
[07/01/2024-09:01:20] [I] Average on 10 runs - GPU latency: 104.396 ms - Host latency: 105.837 ms (enqueue 1.80339 ms)
[07/01/2024-09:01:20] [I]
[07/01/2024-09:01:20] [I] === Performance summary ===
[07/01/2024-09:01:20] [I] Throughput: 9.46372 qps
[07/01/2024-09:01:20] [I] Latency: min = 102.919 ms, max = 107.214 ms, mean = 105.486 ms, median = 105.249 ms, percentile(90%) = 106.817 ms, percentile(95%) = 107.079 ms, percentile(99%) = 107.214 ms
[07/01/2024-09:01:20] [I] Enqueue Time: min = 1.4856 ms, max = 2.39795 ms, mean = 1.71269 ms, median = 1.61072 ms, percentile(90%) = 2.03326 ms, percentile(95%) = 2.146 ms, percentile(99%) = 2.39795 ms
[07/01/2024-09:01:20] [I] H2D Latency: min = 1.41699 ms, max = 1.45679 ms, mean = 1.43253 ms, median = 1.43018 ms, percentile(90%) = 1.44392 ms, percentile(95%) = 1.44983 ms, percentile(99%) = 1.45679 ms
[07/01/2024-09:01:20] [I] GPU Compute Time: min = 101.466 ms, max = 105.775 ms, mean = 104.045 ms, median = 103.812 ms, percentile(90%) = 105.38 ms, percentile(95%) = 105.641 ms, percentile(99%) = 105.775 ms
[07/01/2024-09:01:20] [I] D2H Latency: min = 0.00830078 ms, max = 0.0107422 ms, mean = 0.00885213 ms, median = 0.00854492 ms, percentile(90%) = 0.010498 ms, percentile(95%) = 0.0107422 ms, percentile(99%) = 0.0107422 ms
[07/01/2024-09:01:20] [I] Total Host Walltime: 3.17 s
[07/01/2024-09:01:20] [I] Total GPU Compute Time: 3.12134 s
[07/01/2024-09:01:20] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/01/2024-09:01:20] [V]
[07/01/2024-09:01:20] [V] === Explanations of the performance metrics ===
[07/01/2024-09:01:20] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[07/01/2024-09:01:20] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[07/01/2024-09:01:20] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/01/2024-09:01:20] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[07/01/2024-09:01:20] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[07/01/2024-09:01:20] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[07/01/2024-09:01:20] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[07/01/2024-09:01:20] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.

The model is too heavy for your GPU. Please consult TAO forum or TensorRT forum for how to optimize the model.

I only used the TensorRT framework to quantize the model, is there any other way to do it?

Please consult TAO forum or TensorRT forum.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.