Another DLA question

Hi,
Trying to benchmarking a portion of my network on the DLA engine. I fail to understand the profiler’s output.
Any idea how come the “business logic” of the network is so fast:
Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.005312ms

However the “output” takes so much time? Is it just an error in the profiler output?
Layer [4]: [output from nvm]: 13.2621ms

This is the relevant output:

Blockquote
--------------- Layers running on DLA:
(Unnamed Layer* 0) [Convolution], (Unnamed Layer* 1) [Scale], (Unnamed Layer* 2) [Activation], (Unnamed Layer* 3) [Pooling], (Unnamed Layer* 4) [Convolution],
--------------- Layers running on GPU:

--------------- Timing {(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}(31)
Tactic 548835008419 is the only option, timing skipped
0: [(Unnamed Layer* 0) [Convolution]], type: kCONVOLUTION, precision: kHALF, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 64, Stride: 1x1, Padding: 1x1, Dilation: 1x1
Input tensor: input, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 0) [Convolution]_output, kFLOAT, Dims: 3[64, 512, 512]
1: [(Unnamed Layer* 1) [Scale]], type: kSCALE, precision: kFLOAT, inputs: 1, outputs: 1
Mode: 0(0, 1, 2)
Shifts: kFLOAT, count: 0
Scales: kFLOAT, count: 1
Powers: kFLOAT, count: 0
Input tensor: (Unnamed Layer* 0) [Convolution]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 1) [Scale]_output, kFLOAT, Dims: 3[64, 512, 512]
2: [(Unnamed Layer* 2) [Activation]], type: kACTIVATION, precision: kFLOAT, inputs: 1, outputs: 1
Type: 0 (Relu: 0
Alpha: 1.123, Beta: 4.213
Input tensor: (Unnamed Layer* 1) [Scale]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 2) [Activation]_output, kFLOAT, Dims: 3[64, 512, 512]
3: [(Unnamed Layer* 3) [Pooling]], type: kPOOLING, precision: kFLOAT, inputs: 1, outputs: 1
Pooling:
Type : kMAX
Padding Mode: kEXPLICIT_ROUND_DOWN
Window : 3, 3
Stride : 2, 2
Padding : 1, 1
Pre Padding : 2: 1, 1,
Post Padding: 2, 1, 1,
getBlendFactor: 0
getAverageCountExcludesPadding: 1
Input tensor: (Unnamed Layer* 2) [Activation]_output, kFLOAT, Dims: 3[64, 512, 512]
Output tensor: (Unnamed Layer* 3) [Pooling]_output, kFLOAT, Dims: 3[64, 256, 256]
4: [(Unnamed Layer* 4) [Convolution]], type: kCONVOLUTION, precision: kHALF, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 2, Stride: 2x2, Padding: 1x1, Dilation: 1x1
Input tensor: (Unnamed Layer* 3) [Pooling]_output, kFLOAT, Dims: 3[64, 256, 256]
Output tensor: output, kFLOAT, Dims: 3[2, 128, 128]

[RVLayerProfiler - …]: Layer [1]: [input to nvm]: 6.11952ms
[RVLayerProfiler - …]: Layer [2]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.005952ms
[RVLayerProfiler - …]: Layer [3]: [input copy finish]: 0ms
[RVLayerProfiler - …]: Layer [4]: [output from nvm]: 13.2621ms
[RVLayerProfiler - …]: Layer [5]: [output copy finish]: 0.003808ms

I can send a test code if relevant.

Thanks
Eyal

Hi,

Based on your log, the bottleneck is from memory copy rather than inference.

[input to nvm]: 6.11952ms
...
[output from nvm]: 13.2621ms

Please noticed that CPU and GPU share the same physical memory on Jetson platform.
So you don’t need to real copy them but just allow the access for the heterogeneous processor.

There are some different kind memory can be used for ‘zero copy’.
Please check this page for more information:

Thanks.

Thanks, I’ll give it a try.
However the reason I asked is because the profiler output for the DLA doesn’t make sense too much to me.

This is what the DLA says for this small test network:

Blockquote
[RVLayerProfiler - …]: Layer [46]: [input to nvm]: 0.211264ms
[RVLayerProfiler - …]: Layer [47]: [{(Unnamed Layer* 0) [Convolution],(Unnamed Layer* 1) [Scale],(Unnamed Layer* 2) [Activation],(Unnamed Layer* 3) [Pooling],(Unnamed Layer* 4) [Convolution]}]: 0.347616ms
[RVLayerProfiler - …]: Layer [48]: [input copy finish]: 0.0952ms
[RVLayerProfiler - …]: Layer [49]: [output from nvm]: 7.02112ms
[RVLayerProfiler - …]: Layer [50]: [output copy finish]: 0.004096ms
Total host: [82.2292 ms]
Average time: [8.22293 ms]

And this is for the GPU (by just changing the builder->setDefaultDeviceType type)

Blockquote
[RVLayerProfiler - …]: Layer [1]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation] input reformatter 0]: 0.129888ms
[RVLayerProfiler - …]: Layer [2]: [(Unnamed Layer* 0) [Convolution] + (Unnamed Layer* 2) [Activation]]: 1.23267ms
[RVLayerProfiler - …]: Layer [3]: [(Unnamed Layer* 3) [Pooling]]: 0.41904ms
[RVLayerProfiler - …]: Layer [4]: [(Unnamed Layer* 4) [Convolution]]: 0.196832ms
[RVLayerProfiler - …]: Layer [5]: [(Unnamed Layer* 4) [Convolution] output reformatter 0]: 0.006944ms
Total host: [24.2485 ms]
Average time: [2.42485 ms]

So the whole network calculation on the DLA takes 0.347ms where as the calculation of the same input/network on the GPU takes 1.8458ms (1.232ms + 0.419ms + 0.1968ms) ?

Is this possible? I thought maybe the profiler, when used for DLA ops, yields wrong results and sends me on a wild goose chase…

thanks
Eyal

Hi,
Attached is a repro for this. Managed memory did not make a change as far as I can tell.
./a.out 0 100 —> For GPU: Average time: [2.50781 ms]
./a.out 1 100 —> For DLA: Average time: [8.23565 ms]

Compile with:
/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -I"…/common" -I"/usr/local/cuda/include" -I"/usr/local/cuda/include" -I/usr/src/tensorrt/samples/common/ -o DLATest_1.o -c DLATest_1.cu
/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -L"/usr/local/cuda/lib64" -L"/usr/local/cuda/lib64" -lnvinfer -lnvparsers -lnvinfer_plugin -lnvonnxparser -lcudnn -lcublas -lcudart -lrt -ldl -lpthread -lopencv_highgui -lopencv_core -o a.out DLATest_1.o
DLATest_1_cu.txt (10.9 KB)

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

We try to reproduce this issue in our environment but meet the segmentation fault error.
Is the GeneralUtils.h file used in DLATest same as the utils.h of CUDA cores vs Tensor Cores?

Thanks.