DINO-FAN_base INT8 model is not faster than FP16 model

• Hardware Platform (Jetson / GPU) jetson orin nano 8G / NVIDIA L4 GPU
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) 5.1.2
• TensorRT Version 8.5.2 / 8.5.3
• NVIDIA GPU Driver Version (valid for GPU only) 525.125.06
• Issue Type( questions, new requirements, bugs) bugs
• How to reproduce the issue ?
I try to build DINO-FAN_base INT8 model from Retail Object Detection, but it is not faster than FP16 model.
I build it on the docker image(nvcr.io/nvidia/tao/tao-toolkit:5.1.0-deploy) with NVIDIA L4 GPU and below setting from DINO with TAO Deploy:

$DEFAULT_SPEC:

gen_trt_engine:
onnx_file: /path/to/onnx_file
trt_engine: /path/to/trt_engine
input_channel: 3
input_width: 960
input_height: 544
tensorrt:
data_type: int8
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
calibration:
cal_image_dir:
- /path/to/cal/images
cal_cache_file: /path/to/cal.bin
cal_batch_size: 10
cal_batches: 100
results_dir: /path/to/results

Build command:

dino gen_trt_engine -e $DEFAULT_SPEC

trtexec benchmark on FP16 model

[11/03/2023-06:01:23] [I] === Performance summary ===
[11/03/2023-06:01:23] [I] Throughput: 19.5048 qps
[11/03/2023-06:01:23] [I] Latency: min = 46.9167 ms, max = 52.901 ms, mean = 51.2365 ms, median = 51.2278 ms, percentile(90%) = 51.7795 ms, percentile(95%) = 52.1802 ms, percentile(99%) = 52.901 ms
[11/03/2023-06:01:23] [I] Enqueue Time: min = 46.8978 ms, max = 52.8832 ms, mean = 51.2152 ms, median = 51.2097 ms, percentile(90%) = 51.752 ms, percentile(95%) = 52.1658 ms, percentile(99%) = 52.8832 ms
[11/03/2023-06:01:23] [I] H2D Latency: min = 0.514343 ms, max = 0.52417 ms, mean = 0.515134 ms, median = 0.514893 ms, percentile(90%) = 0.515381 ms, percentile(95%) = 0.515625 ms, percentile(99%) = 0.52417 ms
[11/03/2023-06:01:23] [I] GPU Compute Time: min = 46.3893 ms, max = 52.3755 ms, mean = 50.7101 ms, median = 50.7035 ms, percentile(90%) = 51.2461 ms, percentile(95%) = 51.6558 ms, percentile(99%) = 52.3755 ms
[11/03/2023-06:01:23] [I] D2H Latency: min = 0.0090332 ms, max = 0.0297852 ms, mean = 0.0113305 ms, median = 0.0100098 ms, percentile(90%) = 0.0131226 ms, percentile(95%) = 0.0284424 ms, percentile(99%) = 0.0297852 ms
[11/03/2023-06:01:23] [I] Total Host Walltime: 2.40966 s
[11/03/2023-06:01:23] [I] Total GPU Compute Time: 2.38337 s
[11/03/2023-06:01:23] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/03/2023-06:01:23] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/03/2023-06:01:23] [W] * GPU compute time is unstable, with coefficient of variance = 1.4829%.
[11/03/2023-06:01:23] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/03/2023-06:01:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-06:01:23] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8503] # trtexec --loadEngine=dino_binary_fp16.engine --shapes=inputs:1x3x544x960

trtexec benchmark on INT8 model

[11/03/2023-06:02:46] [I] === Performance summary ===
[11/03/2023-06:02:46] [I] Throughput: 18.9483 qps
[11/03/2023-06:02:46] [I] Latency: min = 49.1602 ms, max = 54.325 ms, mean = 53.2089 ms, median = 53.1924 ms, percentile(90%) = 53.7622 ms, percentile(95%) = 54.1433 ms, percentile(99%) = 54.325 ms
[11/03/2023-06:02:46] [I] Enqueue Time: min = 46.1907 ms, max = 54.0582 ms, mean = 52.4956 ms, median = 52.5861 ms, percentile(90%) = 53.0886 ms, percentile(95%) = 53.3435 ms, percentile(99%) = 54.0582 ms
[11/03/2023-06:02:46] [I] H2D Latency: min = 0.564697 ms, max = 0.581055 ms, mean = 0.575512 ms, median = 0.575378 ms, percentile(90%) = 0.579834 ms, percentile(95%) = 0.5802 ms, percentile(99%) = 0.581055 ms
[11/03/2023-06:02:46] [I] GPU Compute Time: min = 48.5796 ms, max = 53.7395 ms, mean = 52.624 ms, median = 52.6085 ms, percentile(90%) = 53.1793 ms, percentile(95%) = 53.5624 ms, percentile(99%) = 53.7395 ms
[11/03/2023-06:02:46] [I] D2H Latency: min = 0.0065918 ms, max = 0.0115967 ms, mean = 0.00936757 ms, median = 0.00927734 ms, percentile(90%) = 0.0107422 ms, percentile(95%) = 0.0108643 ms, percentile(99%) = 0.0115967 ms
[11/03/2023-06:02:46] [I] Total Host Walltime: 2.42766 s
[11/03/2023-06:02:46] [I] Total GPU Compute Time: 2.4207 s
[11/03/2023-06:02:46] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/03/2023-06:02:46] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/03/2023-06:02:46] [W] * GPU compute time is unstable, with coefficient of variance = 1.35389%.
[11/03/2023-06:02:46] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/03/2023-06:02:46] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-06:02:46] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8503] # trtexec --loadEngine=dino_binary_int8.engine --shapes=inputs:1x3x544x960

I also try to build the INT8 model on jetson orin nano 8G machine with below setting, but it is still not faster than FP16 model.
docker image: nvcr.io/nvidia/deepstream:6.3-triton-multiarch
INT8 calibration cache file:
cal_bin.txt (342.4 KB)

build command:

trtexec --onnx=retail_object_detection_dino_binary.onnx --saveEngine=dino_int8_best.engine --best --calib=cal.bin

trtexec benchmark on FP16 model

[11/03/2023-05:48:41] [I] === Performance summary ===
[11/03/2023-05:48:41] [I] Throughput: 2.27153 qps
[11/03/2023-05:48:41] [I] Latency: min = 430.488 ms, max = 451.942 ms, mean = 440.405 ms, median = 436.842 ms, percentile(90%) = 451.395 ms, percentile(95%) = 451.942 ms, percentile(99%) = 451.942 ms
[11/03/2023-05:48:41] [I] Enqueue Time: min = 430.434 ms, max = 451.745 ms, mean = 440.095 ms, median = 436.486 ms, percentile(90%) = 451.135 ms, percentile(95%) = 451.745 ms, percentile(99%) = 451.745 ms
[11/03/2023-05:48:41] [I] H2D Latency: min = 0.370117 ms, max = 0.408691 ms, mean = 0.385107 ms, median = 0.380371 ms, percentile(90%) = 0.399414 ms, percentile(95%) = 0.408691 ms, percentile(99%) = 0.408691 ms
[11/03/2023-05:48:41] [I] GPU Compute Time: min = 430.08 ms, max = 451.546 ms, mean = 440.007 ms, median = 436.456 ms, percentile(90%) = 450.984 ms, percentile(95%) = 451.546 ms, percentile(99%) = 451.546 ms
[11/03/2023-05:48:41] [I] D2H Latency: min = 0.00683594 ms, max = 0.0356445 ms, mean = 0.0130371 ms, median = 0.0109863 ms, percentile(90%) = 0.0112305 ms, percentile(95%) = 0.0356445 ms, percentile(99%) = 0.0356445 ms
[11/03/2023-05:48:41] [I] Total Host Walltime: 4.40232 s
[11/03/2023-05:48:41] [I] Total GPU Compute Time: 4.40007 s
[11/03/2023-05:48:41] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-05:48:41] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=dino_binary_fp16.engine --shapes=inputs:1x3x544x960

trtexec benchmark on INT8 model

[11/03/2023-05:50:58] [I] === Performance summary ===
[11/03/2023-05:50:58] [I] Throughput: 2.2751 qps
[11/03/2023-05:50:58] [I] Latency: min = 427.809 ms, max = 449.693 ms, mean = 439.688 ms, median = 439.499 ms, percentile(90%) = 446.35 ms, percentile(95%) = 449.693 ms, percentile(99%) = 449.693 ms
[11/03/2023-05:50:58] [I] Enqueue Time: min = 427.512 ms, max = 449.401 ms, mean = 439.386 ms, median = 438.994 ms, percentile(90%) = 446.306 ms, percentile(95%) = 449.401 ms, percentile(99%) = 449.401 ms
[11/03/2023-05:50:58] [I] H2D Latency: min = 0.371094 ms, max = 0.405762 ms, mean = 0.383691 ms, median = 0.375488 ms, percentile(90%) = 0.398193 ms, percentile(95%) = 0.405762 ms, percentile(99%) = 0.405762 ms
[11/03/2023-05:50:58] [I] GPU Compute Time: min = 427.427 ms, max = 449.286 ms, mean = 439.285 ms, median = 439.089 ms, percentile(90%) = 445.9 ms, percentile(95%) = 449.286 ms, percentile(99%) = 449.286 ms
[11/03/2023-05:50:58] [I] D2H Latency: min = 0.00634766 ms, max = 0.0527344 ms, mean = 0.0193604 ms, median = 0.0100098 ms, percentile(90%) = 0.0390625 ms, percentile(95%) = 0.0527344 ms, percentile(99%) = 0.0527344 ms
[11/03/2023-05:50:58] [I] Total Host Walltime: 4.3954 s
[11/03/2023-05:50:58] [I] Total GPU Compute Time: 4.39285 s
[11/03/2023-05:50:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-05:50:58] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=dino_int8_best.engine --shapes=inputs:1x3x544x960

Moving to TAO forum for better support.

Could you double check while running all inside the tao deploy docker?

You can login tao deploy docker,
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

Then, run following command (Refer to https://github.com/NVIDIA/tao_tutorials/blob/main/notebooks/tao_launcher_starter_kit/metric_learning_recognition/metric_learning_recognition.ipynb)

ml_recog gen_trt_engine xxx

Then, use the trtexec in this deploy docker and run again. There is trtexec in the docker by default.

The INT8 model is still slower than FP16 one after following your suggestion.
Below is setting and logs:
$SPEC_FILE:

results_dir: /path/to/results
gen_trt_engine:
onnx_file: /path/to/retail_object_detection_dino_binary.onnx
trt_engine: /path/to/dino_official_binary_int8_b1.engine
batch_size: -1
verbose: True
tensorrt:
data_type: int8
workspace_size: 10240
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
calibration:
cal_cache_file: /path/to/cal_b1.bin
cal_batch_size: 1
cal_batches: 1000
cal_image_dir:
- /path/to/cal/images

Build command:

ml_recog gen_trt_engine -e $SPEC_FILE

Build log:
gen_int8.log (1.7 MB)

INT8 calibration cache file:
cal_b1.bin.txt (342.4 KB)

trtexec benchmark on INT8 model

[11/06/2023-11:14:29] [I] === Performance summary ===
[11/06/2023-11:14:29] [I] Throughput: 18.6675 qps
[11/06/2023-11:14:29] [I] Latency: min = 49.8384 ms, max = 55.6736 ms, mean = 53.9995 ms, median = 53.7979 ms, percentile(90%) = 55.3287 ms, percentile(95%) = 55.3457 ms, percentile(99%) = 55.6736 ms
[11/06/2023-11:14:29] [I] Enqueue Time: min = 46.5236 ms, max = 55.0415 ms, mean = 53.2848 ms, median = 53.1683 ms, percentile(90%) = 54.7729 ms, percentile(95%) = 54.8506 ms, percentile(99%) = 55.0415 ms
[11/06/2023-11:14:29] [I] H2D Latency: min = 0.550659 ms, max = 0.585205 ms, mean = 0.574882 ms, median = 0.576355 ms, percentile(90%) = 0.580566 ms, percentile(95%) = 0.580811 ms, percentile(99%) = 0.585205 ms
[11/06/2023-11:14:29] [I] GPU Compute Time: min = 49.2626 ms, max = 55.0984 ms, mean = 53.4153 ms, median = 53.2153 ms, percentile(90%) = 54.743 ms, percentile(95%) = 54.7655 ms, percentile(99%) = 55.0984 ms
[11/06/2023-11:14:29] [I] D2H Latency: min = 0.00634766 ms, max = 0.010498 ms, mean = 0.00931317 ms, median = 0.00952148 ms, percentile(90%) = 0.0103149 ms, percentile(95%) = 0.010498 ms, percentile(99%) = 0.010498 ms
[11/06/2023-11:14:29] [I] Total Host Walltime: 2.46418 s
[11/06/2023-11:14:29] [I] Total GPU Compute Time: 2.4571 s
[11/06/2023-11:14:29] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/06/2023-11:14:29] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/06/2023-11:14:29] [W] * GPU compute time is unstable, with coefficient of variance = 1.81827%.
[11/06/2023-11:14:29] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/06/2023-11:14:29] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/06/2023-11:14:29] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8503] # trtexec --loadEngine=dino_official_binary_int8_b1.engine --shapes=inputs:1x3x544x960

Hi,
After checking, right now we don’t support int8 in DINO (the detection model arch) as the accuracy would drop a lot. And the reason for slower int8 trt engine is that many layers in DINO transformer fail to support INT8 so they’re converted to FP32 instead.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.