Infer resnet_50 on DLA using TensorRT with the command below:
trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --model=/usr/src/tensorrt/data/resnet50/ResNet50_fp32.caffemodel --output=fc1000 --useDLACore=0 --int8 --workspace=1024 --memPoolSize=dlaSRAM:1
And profiling result is:
[03/01/2023-14:22:22] [I] === Performance summary ===
[03/01/2023-14:22:22] [I] Throughput: 655.174 qps
[03/01/2023-14:22:22] [I] Latency: min = 1.71802 ms, max = 2.26781 ms, mean = 1.74723 ms, median = 1.72763 ms, percentile(99%) = 2.23355 ms
[03/01/2023-14:22:22] [I] Enqueue Time: min = 1.22845 ms, max = 2.19363 ms, mean = 1.45464 ms, median = 1.4364 ms, percentile(99%) = 1.94211 ms
[03/01/2023-14:22:22] [I] H2D Latency: min = 0.20874 ms, max = 0.227844 ms, mean = 0.210525 ms, median = 0.210083 ms, percentile(99%) = 0.217041 ms
[03/01/2023-14:22:22] [I] GPU Compute Time: min = 1.49133 ms, max = 2.03729 ms, mean = 1.51961 ms, median = 1.5 ms, percentile(99%) = 2.00461 ms
[03/01/2023-14:22:22] [I] D2H Latency: min = 0.0158081 ms, max = 0.0202637 ms, mean = 0.0170972 ms, median = 0.0168457 ms, percentile(99%) = 0.0187988 ms
[03/01/2023-14:22:22] [I] Total Host Walltime: 3.00378 s
The total latency is 1.747ms.
Then I profiled using nsys with the command below:
nsys profile -t cuda,nvtx,nvmedia,osrt,cudla --accelerator-trace=nvmedia trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --model=/usr/src/tensorrt/data/resnet50/ResNet50_fp32.caffemodel --output=fc1000 --useDLACore=0 --int8 --workspace=1024 --memPoolSize=dlaSRAM:1
Open the profiled file with Nvidia Nsight System
, the screenshot of one frame among the total profile interations as below.
A complete inferring process consists of copying data to dla, inferring on dla, and copying data from dla. As selected in the screenshot above is one complete inferring process. The total latency is 2.375ms, much longer than the profiling result of TensorRT. Even the task on DLA takes 1.75ms which is longer than the latency of the whole inferring process 1.747ms.
Can somebody take a look at this case and give me some insight on what happened here?
P.S. You can simply reproduce by using the model under TensorRT installation folder.