NVPROF and DLA

eyalhir74 · June 25, 2020, 1:28pm

Hi,
I have a network which when targeted for the DLA takes ~64ms as oppose to ~10ms on the GPU.
Seems like there’s a shuffle layer, which suppose to actually run on the GPU, which takes 26ms out of the 64.
I see the 26ms when using the setProfiler method.
However I don’t see those when running the application under NVPROF.
Is that on purpose? I can’t see DLA related stuff via NVPROF?

thanks
Eyal

eyalhir74 · June 28, 2020, 10:22am

I am also baffled by why this layer takes 26ms.
I’ve added some prints after the buildCudaEngine process and during the inference using the profiler, to shed some light into this issue.
Any suggestions as to why this layer takes so much time? It doesn’t exist when the engine is configured to the GPU.

Blockquote
Layer [57]: [(Unnamed Layer* 69) [Shuffle] input reformatter 0]: 26.9848ms
Default DLA is enabled but layer (Unnamed Layer* 69) [Shuffle] is not running on DLA, falling back to GPU.
Adding reformat layer: (Unnamed Layer* 69) [Shuffle] reformatted input 0 (3x3_s1/Conv2D_raw_output___5:0) from Half(1,64,4096:16,8192) to Float(1,64,4096,131072)
For layer (Unnamed Layer* 69) [Shuffle] a higher-precision implementation was chosen than was requested because it resulted in faster network performance
68: [(Unnamed Layer* 68) [Convolution]], type: kCONVOLUTION, precision: kFLOAT, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 32, Stride: 1x1, Padding: 1x1, Dilation: 1x1
Input tensor: R/Relu_16:0, kFLOAT, Dims: 3[512, 64, 64]
Output tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
69: [(Unnamed Layer* 69) [Shuffle]], type: kSHUFFLE, precision: kFLOAT, inputs: 1, outputs: 1
Shuffle:
First transpose: [1, 2, 0, 3, 4, 5, 6, 1, ]
Second transpose: [0, 1, 2, 3, 4, 5, 6, 7, ]
Reshape:
Input tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
Output tensor: 3x3_s1/Conv2D:0, kFLOAT, Dims: 3[64, 64, 32]
Blockquote

AastaLLL · June 29, 2020, 2:31am

Hi,

To give a further suggestion, would you mind to share the model with us.

nvprof doesn’t support DLA profiling yet.
You can check the DLA status via this command:

$ cat /sys/devices/platform/host1x/15880000.nvdla0/power/runtime_status

E.g.

$ cat /sys/devices/platform/host1x/15880000.nvdla1/power/runtime_status
active
$ cat /sys/devices/platform/host1x/15880000.nvdla1/power/runtime_status
suspended

Thanks.

eyalhir74 · June 29, 2020, 11:05am

GeneralUtils_h.txt (3.0 KB) Thanks for the prompt answer. I’ve managed to reproduce it in a side project.

Attached are .cu, .cuh and the .sh to compile it.
I guess that reducing the size of : m_conv_nb_output_maps and m_conv_2_input_dims.d[0] reduces the time.
However those are the sizes we use in the application and even for smaller ones, there’s a x5-x6 factor in favor of the GPU compared to the DLA, making the DLA not usable for us.

Please compile (./compile_nvcc.sh) and run it like this:

Blockquote
GPU:
./a.out 0 0 10
[RVLayerProfiler - …]: Layer [1]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation] input reformatter 0]: 0.29728ms
[RVLayerProfiler - …]: Layer [2]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]]: 2.89587ms
[RVLayerProfiler - …]: Layer [3]: [(Unnamed Layer* 3) [Convolution]]: 0.290144ms
[RVLayerProfiler - …]: Layer [4]: [(Unnamed Layer* 4) [Shuffle] input reformatter 0]: 0.01056ms
[RVLayerProfiler - …]: Layer [5]: [(Unnamed Layer* 4) [Shuffle]]: 0.019456ms
Total host: [40.9705 ms]
Average time: [4.09705 ms]
[RVLayerProfiler - …]: Total time for 50 layers: 37.8532ms
[RVLayerProfiler - …]: Shortest layer: [(Unnamed Layer* 4) [Shuffle] input reformatter 0] ran for [0.01008ms]
[RVLayerProfiler - …]: Longest layer: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]] ran for [4.4231ms]

Blockquote
DLA:
[RVLayerProfiler - …]: Layer [1]: [input to nvm]: 1.0384ms
[RVLayerProfiler - …]: Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]}]: 0.00368ms
[RVLayerProfiler - …]: Layer [3]: [input copy finish]: 0.102336ms
[RVLayerProfiler - …]: Layer [4]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output reformatter 0]: 25.6063ms
[RVLayerProfiler - …]: Layer [5]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output to be reformatted 0 finish]: 0.004896ms
[RVLayerProfiler - …]: Layer [6]: [(Unnamed Layer* 4) [Shuffle]]: 0.021696ms
Total host: [275.006 ms]
Average time: [27.5006 ms]
[RVLayerProfiler - …]: Total time for 60 layers: 268.53ms
[RVLayerProfiler - …]: Shortest layer: [input copy finish] ran for [0.000544ms]
[RVLayerProfiler - …]: Longest layer: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output reformatter 0] ran for [26.1124ms]
[RVLayerProfiler - …]: Average time : [4.40213ms]

compile_nvcc_sh.txt (564 Bytes) Test_cu.txt (18.0 KB) Test_cuh.txt (2.0 KB)

AastaLLL · June 30, 2020, 2:26am

Hi,

Thanks for the sample.

We are going to reproduce this issue in our environment.
Will update here if we got any progress.

Thanks.

AastaLLL · June 30, 2020, 8:24am

Hi,

The sample requires a custom file called GeneralUtils.h.
Could you also share the file with us?

Thanks.

eyalhir74 · June 30, 2020, 8:31am

Hi,
I’ve updated the previous post with this file.

Many thanks!

Eyal

AastaLLL · July 15, 2020, 5:32am

Please also check the previous topic for the following update:

Thanks.

Topic		Replies	Views
Another DLA question Jetson AGX Xavier dla	4	909	July 15, 2020
Deep Learning Accelerator problems DRIVE AGX Xavier General	1	1523	July 2, 2019
DLA purpose Jetson AGX Xavier	1	6250	January 21, 2019
DLA enabled Network considerably slower Jetson AGX Xavier dla	1	872	July 13, 2020
DLA and GPU cores at the same time Jetson AGX Xavier dla	19	10878	August 27, 2020
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	17	3305	June 8, 2022
Profiling DLA with GPU fallback on Jetson Xavier Jetson AGX Xavier dla	5	1677	June 23, 2021
DLA / GPU question Jetson AGX Xavier dla	5	1101	May 20, 2020
Big difference between using DLA core and not using DLA core Jetson Xavier NX tensorrt , dla	3	3211	September 10, 2020
using DLA but not accelerate Jetson AGX Xavier	1	1582	August 16, 2019

NVPROF and DLA

Related topics