• Hardware Platform (Jetson / GPU) jetson orin nano 8G / NVIDIA L4 GPU
• DeepStream Version 6.3
• JetPack Version (valid for Jetson only) 5.1.2
• TensorRT Version 8.5.2 / 8.5.3
• NVIDIA GPU Driver Version (valid for GPU only) 525.125.06
• Issue Type( questions, new requirements, bugs) bugs
• How to reproduce the issue ?
I try to build DINO-FAN_base INT8 model from Retail Object Detection, but it is not faster than FP16 model.
I build it on the docker image(nvcr.io/nvidia/tao/tao-toolkit:5.1.0-deploy) with NVIDIA L4 GPU and below setting from DINO with TAO Deploy:
$DEFAULT_SPEC:
gen_trt_engine:
onnx_file: /path/to/onnx_file
trt_engine: /path/to/trt_engine
input_channel: 3
input_width: 960
input_height: 544
tensorrt:
data_type: int8
workspace_size: 1024
min_batch_size: 1
opt_batch_size: 1
max_batch_size: 1
calibration:
cal_image_dir:
- /path/to/cal/images
cal_cache_file: /path/to/cal.bin
cal_batch_size: 10
cal_batches: 100
results_dir: /path/to/results
Build command:
dino gen_trt_engine -e $DEFAULT_SPEC
trtexec benchmark on FP16 model
[11/03/2023-06:01:23] [I] === Performance summary ===
[11/03/2023-06:01:23] [I] Throughput: 19.5048 qps
[11/03/2023-06:01:23] [I] Latency: min = 46.9167 ms, max = 52.901 ms, mean = 51.2365 ms, median = 51.2278 ms, percentile(90%) = 51.7795 ms, percentile(95%) = 52.1802 ms, percentile(99%) = 52.901 ms
[11/03/2023-06:01:23] [I] Enqueue Time: min = 46.8978 ms, max = 52.8832 ms, mean = 51.2152 ms, median = 51.2097 ms, percentile(90%) = 51.752 ms, percentile(95%) = 52.1658 ms, percentile(99%) = 52.8832 ms
[11/03/2023-06:01:23] [I] H2D Latency: min = 0.514343 ms, max = 0.52417 ms, mean = 0.515134 ms, median = 0.514893 ms, percentile(90%) = 0.515381 ms, percentile(95%) = 0.515625 ms, percentile(99%) = 0.52417 ms
[11/03/2023-06:01:23] [I] GPU Compute Time: min = 46.3893 ms, max = 52.3755 ms, mean = 50.7101 ms, median = 50.7035 ms, percentile(90%) = 51.2461 ms, percentile(95%) = 51.6558 ms, percentile(99%) = 52.3755 ms
[11/03/2023-06:01:23] [I] D2H Latency: min = 0.0090332 ms, max = 0.0297852 ms, mean = 0.0113305 ms, median = 0.0100098 ms, percentile(90%) = 0.0131226 ms, percentile(95%) = 0.0284424 ms, percentile(99%) = 0.0297852 ms
[11/03/2023-06:01:23] [I] Total Host Walltime: 2.40966 s
[11/03/2023-06:01:23] [I] Total GPU Compute Time: 2.38337 s
[11/03/2023-06:01:23] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/03/2023-06:01:23] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/03/2023-06:01:23] [W] * GPU compute time is unstable, with coefficient of variance = 1.4829%.
[11/03/2023-06:01:23] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/03/2023-06:01:23] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-06:01:23] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8503] # trtexec --loadEngine=dino_binary_fp16.engine --shapes=inputs:1x3x544x960
trtexec benchmark on INT8 model
[11/03/2023-06:02:46] [I] === Performance summary ===
[11/03/2023-06:02:46] [I] Throughput: 18.9483 qps
[11/03/2023-06:02:46] [I] Latency: min = 49.1602 ms, max = 54.325 ms, mean = 53.2089 ms, median = 53.1924 ms, percentile(90%) = 53.7622 ms, percentile(95%) = 54.1433 ms, percentile(99%) = 54.325 ms
[11/03/2023-06:02:46] [I] Enqueue Time: min = 46.1907 ms, max = 54.0582 ms, mean = 52.4956 ms, median = 52.5861 ms, percentile(90%) = 53.0886 ms, percentile(95%) = 53.3435 ms, percentile(99%) = 54.0582 ms
[11/03/2023-06:02:46] [I] H2D Latency: min = 0.564697 ms, max = 0.581055 ms, mean = 0.575512 ms, median = 0.575378 ms, percentile(90%) = 0.579834 ms, percentile(95%) = 0.5802 ms, percentile(99%) = 0.581055 ms
[11/03/2023-06:02:46] [I] GPU Compute Time: min = 48.5796 ms, max = 53.7395 ms, mean = 52.624 ms, median = 52.6085 ms, percentile(90%) = 53.1793 ms, percentile(95%) = 53.5624 ms, percentile(99%) = 53.7395 ms
[11/03/2023-06:02:46] [I] D2H Latency: min = 0.0065918 ms, max = 0.0115967 ms, mean = 0.00936757 ms, median = 0.00927734 ms, percentile(90%) = 0.0107422 ms, percentile(95%) = 0.0108643 ms, percentile(99%) = 0.0115967 ms
[11/03/2023-06:02:46] [I] Total Host Walltime: 2.42766 s
[11/03/2023-06:02:46] [I] Total GPU Compute Time: 2.4207 s
[11/03/2023-06:02:46] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[11/03/2023-06:02:46] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[11/03/2023-06:02:46] [W] * GPU compute time is unstable, with coefficient of variance = 1.35389%.
[11/03/2023-06:02:46] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[11/03/2023-06:02:46] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-06:02:46] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8503] # trtexec --loadEngine=dino_binary_int8.engine --shapes=inputs:1x3x544x960
I also try to build the INT8 model on jetson orin nano 8G machine with below setting, but it is still not faster than FP16 model.
docker image: nvcr.io/nvidia/deepstream:6.3-triton-multiarch
INT8 calibration cache file:
cal_bin.txt (342.4 KB)
build command:
trtexec --onnx=retail_object_detection_dino_binary.onnx --saveEngine=dino_int8_best.engine --best --calib=cal.bin
trtexec benchmark on FP16 model
[11/03/2023-05:48:41] [I] === Performance summary ===
[11/03/2023-05:48:41] [I] Throughput: 2.27153 qps
[11/03/2023-05:48:41] [I] Latency: min = 430.488 ms, max = 451.942 ms, mean = 440.405 ms, median = 436.842 ms, percentile(90%) = 451.395 ms, percentile(95%) = 451.942 ms, percentile(99%) = 451.942 ms
[11/03/2023-05:48:41] [I] Enqueue Time: min = 430.434 ms, max = 451.745 ms, mean = 440.095 ms, median = 436.486 ms, percentile(90%) = 451.135 ms, percentile(95%) = 451.745 ms, percentile(99%) = 451.745 ms
[11/03/2023-05:48:41] [I] H2D Latency: min = 0.370117 ms, max = 0.408691 ms, mean = 0.385107 ms, median = 0.380371 ms, percentile(90%) = 0.399414 ms, percentile(95%) = 0.408691 ms, percentile(99%) = 0.408691 ms
[11/03/2023-05:48:41] [I] GPU Compute Time: min = 430.08 ms, max = 451.546 ms, mean = 440.007 ms, median = 436.456 ms, percentile(90%) = 450.984 ms, percentile(95%) = 451.546 ms, percentile(99%) = 451.546 ms
[11/03/2023-05:48:41] [I] D2H Latency: min = 0.00683594 ms, max = 0.0356445 ms, mean = 0.0130371 ms, median = 0.0109863 ms, percentile(90%) = 0.0112305 ms, percentile(95%) = 0.0356445 ms, percentile(99%) = 0.0356445 ms
[11/03/2023-05:48:41] [I] Total Host Walltime: 4.40232 s
[11/03/2023-05:48:41] [I] Total GPU Compute Time: 4.40007 s
[11/03/2023-05:48:41] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-05:48:41] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=dino_binary_fp16.engine --shapes=inputs:1x3x544x960
trtexec benchmark on INT8 model
[11/03/2023-05:50:58] [I] === Performance summary ===
[11/03/2023-05:50:58] [I] Throughput: 2.2751 qps
[11/03/2023-05:50:58] [I] Latency: min = 427.809 ms, max = 449.693 ms, mean = 439.688 ms, median = 439.499 ms, percentile(90%) = 446.35 ms, percentile(95%) = 449.693 ms, percentile(99%) = 449.693 ms
[11/03/2023-05:50:58] [I] Enqueue Time: min = 427.512 ms, max = 449.401 ms, mean = 439.386 ms, median = 438.994 ms, percentile(90%) = 446.306 ms, percentile(95%) = 449.401 ms, percentile(99%) = 449.401 ms
[11/03/2023-05:50:58] [I] H2D Latency: min = 0.371094 ms, max = 0.405762 ms, mean = 0.383691 ms, median = 0.375488 ms, percentile(90%) = 0.398193 ms, percentile(95%) = 0.405762 ms, percentile(99%) = 0.405762 ms
[11/03/2023-05:50:58] [I] GPU Compute Time: min = 427.427 ms, max = 449.286 ms, mean = 439.285 ms, median = 439.089 ms, percentile(90%) = 445.9 ms, percentile(95%) = 449.286 ms, percentile(99%) = 449.286 ms
[11/03/2023-05:50:58] [I] D2H Latency: min = 0.00634766 ms, max = 0.0527344 ms, mean = 0.0193604 ms, median = 0.0100098 ms, percentile(90%) = 0.0390625 ms, percentile(95%) = 0.0527344 ms, percentile(99%) = 0.0527344 ms
[11/03/2023-05:50:58] [I] Total Host Walltime: 4.3954 s
[11/03/2023-05:50:58] [I] Total GPU Compute Time: 4.39285 s
[11/03/2023-05:50:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[11/03/2023-05:50:58] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --loadEngine=dino_int8_best.engine --shapes=inputs:1x3x544x960