NVPROF and DLA

Hi,
I have a network which when targeted for the DLA takes ~64ms as oppose to ~10ms on the GPU.
Seems like there’s a shuffle layer, which suppose to actually run on the GPU, which takes 26ms out of the 64.
I see the 26ms when using the setProfiler method.
However I don’t see those when running the application under NVPROF.
Is that on purpose? I can’t see DLA related stuff via NVPROF?

thanks
Eyal

I am also baffled by why this layer takes 26ms.
I’ve added some prints after the buildCudaEngine process and during the inference using the profiler, to shed some light into this issue.
Any suggestions as to why this layer takes so much time? It doesn’t exist when the engine is configured to the GPU.

Blockquote
Layer [57]: [(Unnamed Layer* 69) [Shuffle] input reformatter 0]: 26.9848ms
Default DLA is enabled but layer (Unnamed Layer* 69) [Shuffle] is not running on DLA, falling back to GPU.
Adding reformat layer: (Unnamed Layer* 69) [Shuffle] reformatted input 0 (3x3_s1/Conv2D_raw_output___5:0) from Half(1,64,4096:16,8192) to Float(1,64,4096,131072)
For layer (Unnamed Layer* 69) [Shuffle] a higher-precision implementation was chosen than was requested because it resulted in faster network performance
68: [(Unnamed Layer* 68) [Convolution]], type: kCONVOLUTION, precision: kFLOAT, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 32, Stride: 1x1, Padding: 1x1, Dilation: 1x1
Input tensor: R/Relu_16:0, kFLOAT, Dims: 3[512, 64, 64]
Output tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
69: [(Unnamed Layer* 69) [Shuffle]], type: kSHUFFLE, precision: kFLOAT, inputs: 1, outputs: 1
Shuffle:
First transpose: [1, 2, 0, 3, 4, 5, 6, 1, ]
Second transpose: [0, 1, 2, 3, 4, 5, 6, 7, ]
Reshape:
Input tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
Output tensor: 3x3_s1/Conv2D:0, kFLOAT, Dims: 3[64, 64, 32]
Blockquote

Hi,

To give a further suggestion, would you mind to share the model with us.

nvprof doesn’t support DLA profiling yet.
You can check the DLA status via this command:

$ cat /sys/devices/platform/host1x/15880000.nvdla0/power/runtime_status

E.g.

$ cat /sys/devices/platform/host1x/15880000.nvdla1/power/runtime_status
active
$ cat /sys/devices/platform/host1x/15880000.nvdla1/power/runtime_status
suspended

Thanks.

GeneralUtils_h.txt (3.0 KB) Thanks for the prompt answer. I’ve managed to reproduce it in a side project.

Attached are .cu, .cuh and the .sh to compile it.
I guess that reducing the size of : m_conv_nb_output_maps and m_conv_2_input_dims.d[0] reduces the time.
However those are the sizes we use in the application and even for smaller ones, there’s a x5-x6 factor in favor of the GPU compared to the DLA, making the DLA not usable for us.

Please compile (./compile_nvcc.sh) and run it like this:

Blockquote
GPU:
./a.out 0 0 10
[RVLayerProfiler - …]: Layer [1]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation] input reformatter 0]: 0.29728ms
[RVLayerProfiler - …]: Layer [2]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]]: 2.89587ms
[RVLayerProfiler - …]: Layer [3]: [(Unnamed Layer* 3) [Convolution]]: 0.290144ms
[RVLayerProfiler - …]: Layer [4]: [(Unnamed Layer* 4) [Shuffle] input reformatter 0]: 0.01056ms
[RVLayerProfiler - …]: Layer [5]: [(Unnamed Layer* 4) [Shuffle]]: 0.019456ms
Total host: [40.9705 ms]
Average time: [4.09705 ms]
[RVLayerProfiler - …]: Total time for 50 layers: 37.8532ms
[RVLayerProfiler - …]: Shortest layer: [(Unnamed Layer* 4) [Shuffle] input reformatter 0] ran for [0.01008ms]
[RVLayerProfiler - …]: Longest layer: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]] ran for [4.4231ms]

Blockquote
DLA:
[RVLayerProfiler - …]: Layer [1]: [input to nvm]: 1.0384ms
[RVLayerProfiler - …]: Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]}]: 0.00368ms
[RVLayerProfiler - …]: Layer [3]: [input copy finish]: 0.102336ms
[RVLayerProfiler - …]: Layer [4]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output reformatter 0]: 25.6063ms
[RVLayerProfiler - …]: Layer [5]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output to be reformatted 0 finish]: 0.004896ms
[RVLayerProfiler - …]: Layer [6]: [(Unnamed Layer* 4) [Shuffle]]: 0.021696ms
Total host: [275.006 ms]
Average time: [27.5006 ms]
[RVLayerProfiler - …]: Total time for 60 layers: 268.53ms
[RVLayerProfiler - …]: Shortest layer: [input copy finish] ran for [0.000544ms]
[RVLayerProfiler - …]: Longest layer: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Convolution]} output reformatter 0] ran for [26.1124ms]
[RVLayerProfiler - …]: Average time : [4.40213ms]

compile_nvcc_sh.txt (564 Bytes) Test_cu.txt (18.0 KB) Test_cuh.txt (2.0 KB)

Hi,

Thanks for the sample.

We are going to reproduce this issue in our environment.
Will update here if we got any progress.

Thanks.

Hi,

The sample requires a custom file called GeneralUtils.h.
Could you also share the file with us?

Thanks.

Hi,
I’ve updated the previous post with this file.

Many thanks!

Eyal