What is NVM, why it takes so long to get data from it?

zhi_xz · March 1, 2023, 3:53am

Below is profile result of restnet_50 running on DLA profiled by TensorRT. If use --int8, total latency is about 1.7ms, but with –fp16it takes 17ms. The only diffrence is with–fp16enabled, it has a process calledfc1000 to nvm. And fc1000`is last node of the model. So what is nvm, and why it takes so long to copy data from nvm?

[03/01/2023-11:45:20] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/01/2023-11:45:20] [I]                    data to nvm       15.52           0.0862      0.5
[03/01/2023-11:45:20] [I]  {ForeignNode[conv1...fc1000]}       54.09           0.3005      1.7
[03/01/2023-11:45:20] [I]                fc1000 from nvm     3116.62          17.3145     97.8
[03/01/2023-11:45:20] [I]               data copy finish        0.49           0.0027      0.0
[03/01/2023-11:45:20] [I]             fc1000 copy finish        0.80           0.0044      0.0
[03/01/2023-11:45:20] [I]                          Total     3187.51          17.7084    100.0```

AastaLLL · March 1, 2023, 6:36am

Hi,

Are you using Jetson Nano since here is the Nano board?
Nano doesn’t equip with DLA. Only CPU/GPU is available for inference.

Thanks.

zhi_xz · March 1, 2023, 6:50am

No, I’m using Jetson Orin and I’m sure the model is infered on DLA.

Thanks.

AastaLLL · March 1, 2023, 7:17am

Moving your topic to the Orin Nano board.

AastaLLL · March 6, 2023, 6:30am

Hi,

The function indicates memory copy.

NVDLA has its own SRAM memory.
Below is a document for your reference:

http://nvdla.org/primer.html

Thanks.

zhi_xz · March 6, 2023, 7:13am

Thanks for your reply.
Do you have any idea why it takes so long (97.8% of the total latency) to move data from SRAM to CPU? It only takes 0.5% of the total latency to move data from the CPU to SRAM, and the input data size is larger than the output. It does not make sense.

AastaLLL · March 7, 2023, 6:36am

Hi,

TensorRT uses a GPU buffer, so it should be SRAM ↔ GPU.

This is based on your model.
In general, input data (ex. image) is much bigger than the output (ex. bbox/conf).

Thanks.

zhi_xz · March 8, 2023, 2:06am

Yes, it is precisely because the output data is much smaller than the input data that makes it unusual to copy the output data much slower than the input data.
In my model, input size is [1, 3, 224, 224], and the output size is [1000].

AastaLLL · March 9, 2023, 7:25am

Hi,

Could you help us with some experiments?

1. First, have you maximized the device performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Could you profile the app with Nsight System as well?
We want to confirm if the profiler shows the same behavior.

Thanks.

zhi_xz · March 9, 2023, 7:59am

Hi，
1. After maximizing the device performance, the profile result is below:

[03/09/2023-15:49:11] [I] === Performance summary ===
[03/09/2023-15:49:11] [I] Throughput: 59.087 qps
[03/09/2023-15:49:11] [I] Latency: min = 16.8738 ms, max = 17.1115 ms, mean = 16.9044 ms, median = 16.8965 ms, percentile(99%) = 17.0645 ms
[03/09/2023-15:49:11] [I] Enqueue Time: min = 16.8612 ms, max = 17.0829 ms, mean = 16.8878 ms, median = 16.8809 ms, percentile(99%) = 17.0225 ms
[03/09/2023-15:49:11] [I] H2D Latency: min = 0.0222168 ms, max = 0.0327148 ms, mean = 0.0232986 ms, median = 0.0229492 ms, percentile(99%) = 0.0322266 ms
[03/09/2023-15:49:11] [I] GPU Compute Time: min = 16.8442 ms, max = 17.0732 ms, mean = 16.8726 ms, median = 16.8655 ms, percentile(99%) = 17.0198 ms
[03/09/2023-15:49:11] [I] D2H Latency: min = 0.00634766 ms, max = 0.024353 ms, mean = 0.00846027 ms, median = 0.00769043 ms, percentile(99%) = 0.0202637 ms
[03/09/2023-15:49:11] [I] Total Host Walltime: 3.02943 s
[03/09/2023-15:49:11] [I] Total GPU Compute Time: 3.0202 s
[03/09/2023-15:49:11] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[03/09/2023-15:49:11] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[03/09/2023-15:49:11] [I] Explanations of the performance metrics are printed in the verbose logs.
[03/09/2023-15:49:11] [I] 
[03/09/2023-15:49:11] [I] 
[03/09/2023-15:49:11] [I] === Profile (191 iterations ) ===
[03/09/2023-15:49:11] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/09/2023-15:49:11] [I]                    data to nvm        7.59           0.0398      0.2
[03/09/2023-15:49:11] [I]  {ForeignNode[conv1...fc1000]}       17.36           0.0909      0.5
[03/09/2023-15:49:11] [I]                fc1000 from nvm     3191.52          16.7096     99.2
[03/09/2023-15:49:11] [I]               data copy finish        0.33           0.0017      0.0
[03/09/2023-15:49:11] [I]             fc1000 copy finish        0.55           0.0029      0.0
[03/09/2023-15:49:11] [I]                          Total     3217.37          16.8449    100.0

2. The latency of DLA task profiled with nsys resulted in approximately 16.8ms, similar to the trtexec profile result.

I have attached the nsys output for your reference.
Thank you.
resnet_50_3_9.nsys-rep (1.6 MB)

AastaLLL · March 13, 2023, 6:48am

Hi,

We give it a try with the default ResNet50_N2.prototxt and cannot reproduce this issue.

 /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --int8 --allowGPUFallback --dumpProfile --output=fc1000
...
[03/13/2023-14:45:42] [I] === Profile (2305 iterations ) ===
[03/13/2023-14:45:42] [I]                                                                                                Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[03/13/2023-14:45:42] [I]   Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       89.41           0.0388             0.0387      3.0
[03/13/2023-14:45:42] [I]                                               {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}     2905.94           1.2607             1.2590     96.2
[03/13/2023-14:45:42] [I]  Reformatting CopyNode for Output Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       25.33           0.0110             0.0110      0.8
[03/13/2023-14:45:42] [I]                                                                                                Total     3020.68           1.3105             1.3087    100.0
[03/13/2023-14:45:42] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --int8 --allowGPUFallback --dumpProfile --output=fc1000

The bottleneck is the middle inference part and the output copy time is much less then the input one.
Which model do you use?

Thanks.

zhi_xz · March 13, 2023, 6:59am

I also used the default ResNet50_N2.prototxt, but with the --fp16 option instead of --int8. If it’s not too much trouble, could you please try again with --fp16?

AastaLLL · March 14, 2023, 4:19am

Hi,

Sure. Will get back to you later.
Thanks.

AastaLLL · March 15, 2023, 6:26am

Hi,

We tested the fp16 mode but cannot reproduce this issue.
More than 99% of execution time is used for inference.

Could you try the same command in your environment as well?

$  /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000
...
[03/15/2023-14:23:01] [I] === Profile (181 iterations ) ===
[03/15/2023-14:23:01] [I]                                                                                                Layer   Time (ms)   Avg. Time (ms)   Median Time (ms)   Time %
[03/15/2023-14:23:01] [I]   Reformatting CopyNode for Input Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}       16.78           0.0927             0.0921      0.5
[03/15/2023-14:23:01] [I]                                               {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}     3215.93          17.7675            17.7638     99.4
[03/15/2023-14:23:01] [I]  Reformatting CopyNode for Output Tensor 0 to {ForeignNode[conv1 + bn_conv1 + scale_conv1...fc1000]}        1.75           0.0097             0.0097      0.1
[03/15/2023-14:23:01] [I]                                                                                                Total     3234.46          17.8699            17.8659    100.0
[03/15/2023-14:23:01] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000

Thanks.

zhi_xz · March 15, 2023, 6:52am

Hi,
The result with the command you mentioned is below.

[03/15/2023-14:34:04] [I] === Profile (180 iterations ) ===
[03/15/2023-14:34:04] [I]                          Layer   Time (ms)   Avg. Time (ms)   Time %
[03/15/2023-14:34:04] [I]                    data to nvm        7.88           0.0438      0.2
[03/15/2023-14:34:04] [I]  {ForeignNode[conv1...fc1000]}       16.84           0.0935      0.5
[03/15/2023-14:34:04] [I]                fc1000 from nvm     3194.42          17.7468     99.2
[03/15/2023-14:34:04] [I]               data copy finish        0.32           0.0018      0.0
[03/15/2023-14:34:04] [I]             fc1000 copy finish        0.53           0.0030      0.0
[03/15/2023-14:34:04] [I]                          Total     3219.99          17.8888    100.0
[03/15/2023-14:34:04] [I] 
&&&& PASSED TensorRT.trtexec [TensorRT v8400] # /usr/src/tensorrt/bin/trtexec --deploy=/usr/src/tensorrt/data/resnet50/ResNet50_N2.prototxt --useDLACore=0 --fp16 --allowGPUFallback --dumpProfile --output=fc1000

I’ve noticed that we are currently using different versions of TensorRT. It appears that you are using version 8.5.02, while I am using version 8.4.0. I’ve also observed that there are differences in the dumped profiles between these two versions. Specifically, the dumped profile in version 8.4.0 includes the latency for ‘fc1000 from nvm’, while version 8.5.02 does not. I kindly request that you try running the test again using TensorRT version 8.4.0. Thank you for your patience and understanding.

AastaLLL · March 15, 2023, 7:19am

Hi,

Is TensorRT 8.5 an option for you?
It is available in JetPack 5.1 right now.

Thanks.

zhi_xz · March 15, 2023, 7:34am

Sorry, I can not upgrade JetPack now.
And from the profile result we post above, TensorRT 8.5 would not break down the inference process to data moving between SRAM and GPU. To reproduce my issue, it would be best to use version 8.4.0.

AastaLLL · March 16, 2023, 5:30am

Hi,

It’s recommended to upgrade to TensorRT 8.5 directly.
Thanks.

ramc · April 13, 2023, 3:58pm

Check out the DLA github page for Resnet_50 sample: Instructions to Download, Prepare & Run DLA Reference Models.

We also have a FAQ page that addresses some common questions that we see developers run into: Deep-Learning-Accelerator-SW/FAQ

Topic		Replies	Views
Profile results of model running on DLA mismatch between TensorRT and nsys Jetson AGX Orin tensorrt , dla	10	1237	April 5, 2023
Model run on DLA,but most time is"XX from nvm" Jetson Xavier NX dla	2	616	October 18, 2021
Trtexec profile TensorRT	6	3360	October 12, 2021
NVMEDIA_DLA : 528, ERROR: load from memory failed DRIVE AGX Xavier General	2	658	August 31, 2022
tensorrt log info about NVDLA Jetson AGX Xavier	2	869	October 18, 2021
Trt model inference difference Jetson AGX Xavier jetson-inference , cudnn	4	147	November 5, 2024
slower when change DefaultDeviceType from GPU to DLA? Jetson AGX Xavier	3	741	October 18, 2021
TensorRT model inference fully on DLA is slow due to abnormally slow cudaEventSynchronize time Jetson AGX Orin tensorrt , cuda , dla	10	1831	January 17, 2024
Performance data (latency) for VGG16 layer-by-layer inference Jetson AGX Xavier jetson-inference	9	1941	September 5, 2021
Unable to verify Xavier inference benchmarks Jetson AGX Xavier	17	2547	October 18, 2021

What is NVM, why it takes so long to get data from it?

Related topics