Another DLA question

eyalhir74 · July 1, 2020, 1:52pm

Hi,
Trying to benchmarking a portion of my network on the DLA engine. I fail to understand the profiler’s output.
Any idea how come the “business logic” of the network is so fast:
Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.005312ms

However the “output” takes so much time? Is it just an error in the profiler output?
Layer [4]: [output from nvm]: 13.2621ms

This is the relevant output:

Blockquote
--------------- Layers running on DLA:
(Unnamed Layer* 0) [Convolution], (Unnamed Layer* 1) [Scale], (Unnamed Layer* 2) [Activation], (Unnamed Layer* 3) [Pooling], (Unnamed Layer* 4) [Convolution],
--------------- Layers running on GPU:

--------------- Timing {(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}(31)
Tactic 548835008419 is the only option, timing skipped
0: [(Unnamed Layer* 0) [Convolution]], type: kCONVOLUTION, precision: kHALF, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 64, Stride: 1x1, Padding: 1x1, Dilation: 1x1
Input tensor: input, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 0) [Convolution]_output, kFLOAT, Dims: 3[64, 512, 512]
1: [(Unnamed Layer* 1) [Scale]], type: kSCALE, precision: kFLOAT, inputs: 1, outputs: 1
Mode: 0(0, 1, 2)
Shifts: kFLOAT, count: 0
Scales: kFLOAT, count: 1
Powers: kFLOAT, count: 0
Input tensor: (Unnamed Layer* 0) [Convolution]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 1) [Scale]_output, kFLOAT, Dims: 3[64, 512, 512]
2: [(Unnamed Layer* 2) [Activation]], type: kACTIVATION, precision: kFLOAT, inputs: 1, outputs: 1
Type: 0 (Relu: 0
Alpha: 1.123, Beta: 4.213
Input tensor: (Unnamed Layer* 1) [Scale]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 2) [Activation]_output, kFLOAT, Dims: 3[64, 512, 512]
3: [(Unnamed Layer* 3) [Pooling]], type: kPOOLING, precision: kFLOAT, inputs: 1, outputs: 1
Pooling:
Type : kMAX
Padding Mode: kEXPLICIT_ROUND_DOWN
Window : 3, 3
Stride : 2, 2
Padding : 1, 1
Pre Padding : 2: 1, 1,
Post Padding: 2, 1, 1,
getBlendFactor: 0
getAverageCountExcludesPadding: 1
Input tensor: (Unnamed Layer* 2) [Activation]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 3) [Pooling]_output, kFLOAT, Dims: 3[64, 256, 256]
4: [(Unnamed Layer* 4) [Convolution]], type: kCONVOLUTION, precision: kHALF, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 2, Stride: 2x2, Padding: 1x1, Dilation: 1x1
Input tensor: (Unnamed Layer* 3) [Pooling]_output, kFLOAT, Dims: 3[64, 256, 256]
Output tensor: output, kFLOAT, Dims: 3[2, 128, 128]

[RVLayerProfiler - …]: Layer [1]: [input to nvm]: 6.11952ms
[RVLayerProfiler - …]: Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.005952ms
[RVLayerProfiler - …]: Layer [3]: [input copy finish]: 0ms
[RVLayerProfiler - …]: Layer [4]: [output from nvm]: 13.2621ms
[RVLayerProfiler - …]: Layer [5]: [output copy finish]: 0.003808ms

I can send a test code if relevant.

Thanks
Eyal

AastaLLL · July 2, 2020, 2:27am

Hi,

Based on your log, the bottleneck is from memory copy rather than inference.

[input to nvm]: 6.11952ms
...
[output from nvm]: 13.2621ms

Please noticed that CPU and GPU share the same physical memory on Jetson platform.
So you don’t need to real copy them but just allow the access for the heterogeneous processor.

There are some different kind memory can be used for ‘zero copy’.
Please check this page for more information:

Thanks.

eyalhir74 · July 2, 2020, 4:14am

Thanks, I’ll give it a try.
However the reason I asked is because the profiler output for the DLA doesn’t make sense too much to me.

This is what the DLA says for this small test network:

Blockquote
[RVLayerProfiler - …]: Layer [46]: [input to nvm]: 0.211264ms
[RVLayerProfiler - …]: Layer [47]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.347616ms
[RVLayerProfiler - …]: Layer [48]: [input copy finish]: 0.0952ms
[RVLayerProfiler - …]: Layer [49]: [output from nvm]: 7.02112ms
[RVLayerProfiler - …]: Layer [50]: [output copy finish]: 0.004096ms
Total host: [82.2292 ms]
Average time: [8.22293 ms]

And this is for the GPU (by just changing the builder->setDefaultDeviceType type)

Blockquote
[RVLayerProfiler - …]: Layer [1]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation] input reformatter 0]: 0.129888ms
[RVLayerProfiler - …]: Layer [2]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]]: 1.23267ms
[RVLayerProfiler - …]: Layer [3]: [(Unnamed Layer* 3) [Pooling]]: 0.41904ms
[RVLayerProfiler - …]: Layer [4]: [(Unnamed Layer* 4) [Convolution]]: 0.196832ms
[RVLayerProfiler - …]: Layer [5]: [(Unnamed Layer* 4) [Convolution] output reformatter 0]: 0.006944ms
Total host: [24.2485 ms]
Average time: [2.42485 ms]

So the whole network calculation on the DLA takes 0.347ms where as the calculation of the same input/network on the GPU takes 1.8458ms (1.232ms + 0.419ms + 0.1968ms) ?

Is this possible? I thought maybe the profiler, when used for DLA ops, yields wrong results and sends me on a wild goose chase…

thanks
Eyal

eyalhir74 · July 2, 2020, 5:36am

Hi,
Attached is a repro for this. Managed memory did not make a change as far as I can tell.
./a.out 0 100 —> For GPU: Average time: [2.50781 ms]
./a.out 1 100 —> For DLA: Average time: [8.23565 ms]

Compile with:
/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -I"…/common" -I"/usr/local/cuda/include" -I"/usr/local/cuda/include" -I/usr/src/tensorrt/samples/common/ -o DLATest_1.o -c DLATest_1.cu
/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -L"/usr/local/cuda/lib64" -L"/usr/local/cuda/lib64" -lnvinfer -lnvparsers -lnvinfer_plugin -lnvonnxparser -lcudnn -lcublas -lcudart -lrt -ldl -lpthread -lopencv_highgui -lopencv_core -o a.out DLATest_1.o
DLATest_1_cu.txt (10.9 KB)

AastaLLL · July 15, 2020, 6:13am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

We try to reproduce this issue in our environment but meet the segmentation fault error.
Is the GeneralUtils.h file used in DLATest same as the utils.h of CUDA cores vs Tensor Cores?

Thanks.

Topic		Replies	Views
NVPROF and DLA Jetson AGX Xavier dla	8	1243	October 18, 2021
Deep Learning Accelerator problems DRIVE AGX Xavier General	2	1475	October 12, 2021
Profiling DLA with GPU fallback on Jetson Xavier Jetson AGX Xavier dla	6	1596	August 29, 2021
slower when change DefaultDeviceType from GPU to DLA? Jetson AGX Xavier	3	693	October 18, 2021
Trtexec profile TensorRT	6	3275	October 12, 2021
DLA enabled Network considerably slower Jetson AGX Xavier dla	2	844	October 18, 2021
Performance data (latency) for VGG16 layer-by-layer inference Jetson AGX Xavier jetson-inference	9	1831	September 5, 2021
DLA / GPU question Jetson AGX Xavier dla	6	1019	October 18, 2021
Big difference between using DLA core and not using DLA core Jetson Xavier NX tensorrt , dla	4	3114	October 18, 2021
Dla and tensorcore are used at the same time, performance is degraded Jetson AGX Xavier	9	1025	October 23, 2019

Another DLA question

Related topics