What is NVM, why it takes so long to get data from it?

Below is profile result of restnet_50 running on DLA profiled by TensorRT. If use --int8, total latency is about 1.7ms, but with –fp16it takes 17ms. The only diffrence is with–fp16enabled, it has a process calledfc1000 to nvm. And fc1000`is last node of the model. So what is nvm, and why it takes so long to copy data from nvm?

[03/01/2023-11:45:20] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/01/2023-11:45:20] [I]                    data to nvm       15.52           0.0862      0.5
[03/01/2023-11:45:20] [I]  {ForeignNode[conv1...fc1000]}       54.09           0.3005      1.7
[03/01/2023-11:45:20] [I]                fc1000 from nvm     3116.62          17.3145     97.8
[03/01/2023-11:45:20] [I]               data copy finish        0.49           0.0027      0.0
[03/01/2023-11:45:20] [I]             fc1000 copy finish        0.80           0.0044      0.0
[03/01/2023-11:45:20] [I]                          Total     3187.51          17.7084    100.0```

Hi,

Are you using Jetson Nano since here is the Nano board?
Nano doesn’t equip with DLA. Only CPU/GPU is available for inference.

Thanks.

No, I’m using Jetson Orin and I’m sure the model is infered on DLA.

Thanks.

Moving your topic to the Orin Nano board.

Hi,

The function indicates memory copy.

NVDLA has its own SRAM memory.
Below is a document for your reference:

http://nvdla.org/primer.html

Thanks.

Thanks for your reply.
Do you have any idea why it takes so long (97.8% of the total latency) to move data from SRAM to CPU? It only takes 0.5% of the total latency to move data from the CPU to SRAM, and the input data size is larger than the output. It does not make sense.

Hi,

TensorRT uses a GPU buffer, so it should be SRAM ↔ GPU.

This is based on your model.
In general, input data (ex. image) is much bigger than the output (ex. bbox/conf).

Thanks.

Yes, it is precisely because the output data is much smaller than the input data that makes it unusual to copy the output data much slower than the input data.
In my model, input size is [1, 3, 224, 224], and the output size is [1000].

Hi,

Could you help us with some experiments?

1. First, have you maximized the device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Could you profile the app with Nsight System as well?
We want to confirm if the profiler shows the same behavior.

Thanks.

Hi,
1. After maximizing the device performance, the profile result is below:

[03/09/2023-15:49:11] [I] === Performance summary ===
[03/09/2023-15:49:11] [I] Throughput: 59.087 qps
[03/09/2023-15:49:11] [I] Latency: min = 16.8738 ms, max = 17.1115 ms, mean = 16.9044 ms, median = 16.8965 ms, percentile(99%) = 17.0645 ms
[03/09/2023-15:49:11] [I] Enqueue Time: min = 16.8612 ms, max = 17.0829 ms, mean = 16.8878 ms, median = 16.8809 ms, percentile(99%) = 17.0225 ms
[03/09/2023-15:49:11] [I] H2D Latency: min = 0.0222168 ms, max = 0.0327148 ms, mean = 0.0232986 ms, median = 0.0229492 ms, percentile(99%) = 0.0322266 ms
[03/09/2023-15:49:11] [I] GPU Compute Time: min = 16.8442 ms, max = 17.0732 ms, mean = 16.8726 ms, median = 16.8655 ms, percentile(99%) = 17.0198 ms
[03/09/2023-15:49:11] [I] D2H Latency: min = 0.00634766 ms, max = 0.024353 ms, mean = 0.00846027 ms, median = 0.00769043 ms, percentile(99%) = 0.0202637 ms
[03/09/2023-15:49:11] [I] Total Host Walltime: 3.02943 s
[03/09/2023-15:49:11] [I] Total GPU Compute Time: 3.0202 s
[03/09/2023-15:49:11] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/09/2023-15:49:11] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/09/2023-15:49:11] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/09/2023-15:49:11] [I] 
[03/09/2023-15:49:11] [I] 
[03/09/2023-15:49:11] [I] === Profile (191 iterations ) ===
[03/09/2023-15:49:11] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/09/2023-15:49:11] [I]                    data to nvm        7.59           0.0398      0.2
[03/09/2023-15:49:11] [I]  {ForeignNode[conv1...fc1000]}       17.36           0.0909      0.5
[03/09/2023-15:49:11] [I]                fc1000 from nvm     3191.52          16.7096     99.2
[03/09/2023-15:49:11] [I]               data copy finish        0.33           0.0017      0.0
[03/09/2023-15:49:11] [I]             fc1000 copy finish        0.55           0.0029      0.0
[03/09/2023-15:49:11] [I]                          Total     3217.37          16.8449    100.0

2. The latency of DLA task profiled with nsys resulted in approximately 16.8ms, similar to the trtexec profile result.

I have attached the nsys output for your reference.
Thank you.
resnet_50_3_9.nsys-rep (1.6 MB)

Hi,

We give it a try with the default ResNet50_N2.prototxt and cannot reproduce this issue.

 /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --int8 --allowGPUFallback --dumpProfile --output=fc1000
...
[03/13/2023-14:45:42] [I] === Profile (2305 iterations ) ===
[03/13/2023-14:45:42] [I]                                                                                                Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[03/13/2023-14:45:42] [I]   Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       89.41           0.0388             0.0387      3.0
[03/13/2023-14:45:42] [I]                                               {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}     2905.94           1.2607             1.2590     96.2
[03/13/2023-14:45:42] [I]  Reformatting CopyNode for Output Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       25.33           0.0110             0.0110      0.8
[03/13/2023-14:45:42] [I]                                                                                                Total     3020.68           1.3105             1.3087    100.0
[03/13/2023-14:45:42] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --int8 --allowGPUFallback --dumpProfile --output=fc1000

The bottleneck is the middle inference part and the output copy time is much less then the input one.
Which model do you use?

Thanks.

I also used the default ResNet50_N2.prototxt, but with the --fp16 option instead of --int8. If it’s not too much trouble, could you please try again with --fp16?

Hi,

Sure. Will get back to you later.
Thanks.

Hi,

We tested the fp16 mode but cannot reproduce this issue.
More than 99% of execution time is used for inference.

Could you try the same command in your environment as well?

$  /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000
...
[03/15/2023-14:23:01] [I] === Profile (181 iterations ) ===
[03/15/2023-14:23:01] [I]                                                                                                Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[03/15/2023-14:23:01] [I]   Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       16.78           0.0927             0.0921      0.5
[03/15/2023-14:23:01] [I]                                               {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}     3215.93          17.7675            17.7638     99.4
[03/15/2023-14:23:01] [I]  Reformatting CopyNode for Output Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}        1.75           0.0097             0.0097      0.1
[03/15/2023-14:23:01] [I]                                                                                                Total     3234.46          17.8699            17.8659    100.0
[03/15/2023-14:23:01] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000

Thanks.

Hi,
The result with the command you mentioned is below.

[03/15/2023-14:34:04] [I] === Profile (180 iterations ) ===
[03/15/2023-14:34:04] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/15/2023-14:34:04] [I]                    data to nvm        7.88           0.0438      0.2
[03/15/2023-14:34:04] [I]  {ForeignNode[conv1...fc1000]}       16.84           0.0935      0.5
[03/15/2023-14:34:04] [I]                fc1000 from nvm     3194.42          17.7468     99.2
[03/15/2023-14:34:04] [I]               data copy finish        0.32           0.0018      0.0
[03/15/2023-14:34:04] [I]             fc1000 copy finish        0.53           0.0030      0.0
[03/15/2023-14:34:04] [I]                          Total     3219.99          17.8888    100.0
[03/15/2023-14:34:04] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000

I’ve noticed that we are currently using different versions of TensorRT. It appears that you are using version 8.5.02, while I am using version 8.4.0. I’ve also observed that there are differences in the dumped profiles between these two versions. Specifically, the dumped profile in version 8.4.0 includes the latency for ‘fc1000 from nvm’, while version 8.5.02 does not. I kindly request that you try running the test again using TensorRT version 8.4.0. Thank you for your patience and understanding.

Hi,

Is TensorRT 8.5 an option for you?
It is available in JetPack 5.1 right now.

Thanks.

Sorry, I can not upgrade JetPack now.
And from the profile result we post above, TensorRT 8.5 would not break down the inference process to data moving between SRAM and GPU. To reproduce my issue, it would be best to use version 8.4.0.

Hi,

It’s recommended to upgrade to TensorRT 8.5 directly.
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Check out the DLA github page for Resnet_50 sample: Instructions to Download, Prepare & Run DLA Reference Models.

We also have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ