Profile results of model running on DLA mismatch between TensorRT and nsys

Infer resnet_50 on DLA using TensorRT with the command below:

trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --model=/usr/src/tensorrt/data/resnet50/ResNet50_fp32.caffemodel --output=fc1000  --useDLACore=0 --int8 --workspace=1024 --memPoolSize=dlaSRAM:1

And profiling result is:

[03/01/2023-14:22:22] [I] === Performance summary ===
[03/01/2023-14:22:22] [I] Throughput: 655.174 qps
[03/01/2023-14:22:22] [I] Latency: min = 1.71802 ms, max = 2.26781 ms, mean = 1.74723 ms, median = 1.72763 ms, percentile(99%) = 2.23355 ms
[03/01/2023-14:22:22] [I] Enqueue Time: min = 1.22845 ms, max = 2.19363 ms, mean = 1.45464 ms, median = 1.4364 ms, percentile(99%) = 1.94211 ms
[03/01/2023-14:22:22] [I] H2D Latency: min = 0.20874 ms, max = 0.227844 ms, mean = 0.210525 ms, median = 0.210083 ms, percentile(99%) = 0.217041 ms
[03/01/2023-14:22:22] [I] GPU Compute Time: min = 1.49133 ms, max = 2.03729 ms, mean = 1.51961 ms, median = 1.5 ms, percentile(99%) = 2.00461 ms
[03/01/2023-14:22:22] [I] D2H Latency: min = 0.0158081 ms, max = 0.0202637 ms, mean = 0.0170972 ms, median = 0.0168457 ms, percentile(99%) = 0.0187988 ms
[03/01/2023-14:22:22] [I] Total Host Walltime: 3.00378 s

The total latency is 1.747ms.
Then I profiled using nsys with the command below:

nsys profile -t cuda,nvtx,nvmedia,osrt,cudla --accelerator-trace=nvmedia trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --model=/usr/src/tensorrt/data/resnet50/ResNet50_fp32.caffemodel --output=fc1000  --useDLACore=0 --int8 --workspace=1024 --memPoolSize=dlaSRAM:1

Open the profiled file with Nvidia Nsight System, the screenshot of one frame among the total profile interations as below.

A complete inferring process consists of copying data to dla, inferring on dla, and copying data from dla. As selected in the screenshot above is one complete inferring process. The total latency is 2.375ms, much longer than the profiling result of TensorRT. Even the task on DLA takes 1.75ms which is longer than the latency of the whole inferring process 1.747ms.

Can somebody take a look at this case and give me some insight on what happened here?

P.S. You can simply reproduce by using the model under TensorRT installation folder.

Hi,

You are using Orin Nano, is it correct?
Thanks.

Yes, thank you for moving the topic to the right category. And do you have any suggestions for my problem?

Hi,

The latency only counts inference time.
So it should be 1.71ms in the console, and 1.75ms with Nsight System.

This is acceptable.
Since the profiler collects several profiling traces, it’s expected that app will run slower.

Thanks.

Hi,
Thanks for your reply.
TensorRT profiled latency includes the summation of H2D Latency, GPU Compute Time, and D2H Latency detailed in the picture below.

So 1.71ms is not just inference time I think.
Another piece of evidence is inferred with trtexec with parameter --dumpProfile, it prints profile results by layer in the picture below (It’s not the same model). You can tell the total latency includes data moving.

Hi,

The data transfer usually indicates copying an image from the CPU to GPU.

The the transfer time you mentioned (data to nvm) is an internal memcpy.
It is measured as one of the layers between inference.

Thanks.

I apologize for not making myself clear in my previous message. I am unsure if you were able to view the pictures that I posted in my last reply, as they do not seem to be loading on my side.

It should be noted that the latency of 1.71ms as seen on the console includes both the task running on DLA and the data transfer between SRAM and GPU. On the other hand, the latency of 1.75ms as reported by Nsight System only accounts for the task latency run on the DLA.
However, it is worth mentioning that the overall latency, which includes both the DLA task and data copying, is shorter than just the latency of the DLA task alone as measured by nsys. This is the source of my confusion.

To support this claim that the latency in the console reflects data copying, I added --dumpProfile to the trtexec command line argument and the results showed that the total latency included time for data transfer between nvm and fc1000 from nvm.

[03/08/2023-10:09:30] [I] === Performance summary ===
[03/08/2023-10:09:30] [I] Throughput: 59.2112 qps
[03/08/2023-10:09:30] [I] Latency: min = 16.854 ms, max = 17.0804 ms, mean = 16.8707 ms, median = 16.8657 ms, percentile(99%) = 16.9868 ms
[03/08/2023-10:09:30] [I] Enqueue Time: min = 16.8416 ms, max = 17.0437 ms, mean = 16.8557 ms, median = 16.8513 ms, percentile(99%) = 16.9577 ms
[03/08/2023-10:09:30] [I] H2D Latency: min = 0.0223389 ms, max = 0.0327759 ms, mean = 0.0233877 ms, median = 0.0231934 ms, percentile(99%) = 0.0307617 ms
[03/08/2023-10:09:30] [I] GPU Compute Time: min = 16.8237 ms, max = 17.0344 ms, mean = 16.8394 ms, median = 16.8345 ms, percentile(99%) = 16.9484 ms
[03/08/2023-10:09:30] [I] D2H Latency: min = 0.00622559 ms, max = 0.0306396 ms, mean = 0.00788608 ms, median = 0.00720215 ms, percentile(99%) = 0.0218506 ms
[03/08/2023-10:09:30] [I] Total Host Walltime: 3.03997 s
[03/08/2023-10:09:30] [I] Total GPU Compute Time: 3.03109 s
[03/08/2023-10:09:30] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/08/2023-10:09:30] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/08/2023-10:09:30] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/08/2023-10:09:30] [I] 
[03/08/2023-10:09:30] [I] 
[03/08/2023-10:09:30] [I] === Profile (192 iterations ) ===
[03/08/2023-10:09:30] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/08/2023-10:09:30] [I]                    data to nvm        7.07           0.0368      0.2
[03/08/2023-10:09:30] [I]  {ForeignNode[conv1...fc1000]}       15.77           0.0821      0.5
[03/08/2023-10:09:30] [I]                fc1000 from nvm     3205.04          16.6929     99.3
[03/08/2023-10:09:30] [I]               data copy finish        0.35           0.0018      0.0
[03/08/2023-10:09:30] [I]             fc1000 copy finish        0.56           0.0029      0.0
[03/08/2023-10:09:30] [I]                          Total     3228.79          16.8166    100.0

Hi,

Is this duplicated to topic 244464?

If yes, let’s track the following on topic 244464 directly.

Thanks.

I’m sorry, but that’s not quite the topic we’re discussing. Our focus here is on why the latency of the task on DLA is longer than the total latency of the inference, including data transfer between DLA and GPU. Thank you for your understanding.

Hi,

The 1.71ms is the inference time.

The profiling log from trtexec breaks down the inference time into the sub-items, and these items belong to the “inference” part.
Memory copy indicates H2D and D2H, not the data transfer between GPU and DLA.

Also, DLA can only report active or idle status.
So the time reported from the Nsight System should be the active period of DLA.

If you are interested in the sub-item issue, we can discuss it on topic 244464.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.