Actually, there is a bug of DLA, that you can’t run 2 DLA in same process.
You can program it, but 2 DLA will not work at the same time.
This is a known issue and it’s said to be fixed in the future.
Thanks for the answer. I removed one of the DLAs - still the GPU and DLA0 seems to be running serially.
Any other suggestions?
CPU Total time : [470.815 ms]
CPU timer average : [4.70815 ms]
[DLA 1]: Total time for 0 layers: 0ms
[DLA 1]: Shortest layer:  ran for [3.40282e+38ms]
[DLA 1]: Longest layer:  ran for [0ms]
[DLA 1]: Average time : [0ms]
[DLA 0]: Total time for 500 layers: 182.218ms
[DLA 0]: Shortest layer: [output copy finish] ran for [0.00352ms]
[DLA 0]: Longest layer: [output from nvm] ran for [1.99443ms]
[DLA 0]: Average time : [0.363708ms]
[GPU]: Total time for 100 layers: 264.819ms
[GPU]: Shortest layer: [(Unnamed Layer* 0) [Convolution]] ran for [2.5999ms]
[GPU]: Longest layer: [(Unnamed Layer* 0) [Convolution]] ran for [4.1215ms]
[GPU]: Average time : [2.62197ms]
Since DLA is a hardware-based engine, not all the TensorRT operation can be deployed on it.
For those non-supported operations, DLA will fallback them to the GPU.
As a result, GPU may run slower since part of resource is occupied.
NVIDIA’s DLA promises performances of up to 11.4 int8 TOPS and is also capable of FP16 operation at half speed at 5.7 TOPS"
According to Annandtech, and if possible, I would be able to get a 11FP16 TFlops from the TensorCores while getting another 5.7 from each of the DLA’s ? and still have some room in the regular CUDA cores?
I understand this is highly unlikely to achieve this high performance, but generally speaking I should be able to get close to say, 11 + 5.7 + 5.7 FP16 TFlops on the Xavier?
Can you please specify the breakdown between the TFLOPS of the CUDA cores and the Tensor cores?
Knowing that I would be able to estimate how much more FLOPs I can get from the hardware if I know that my code runs currently only on the CUDA cores, right?
As for the DLA, you wrote “Each DLA can reach TFLOPS” - did you mean each can reach 1TFLOPS? something else?
Based on the image:
After the inference (ExecutionContext::enqueue), there are cudaEventRecord and cudaEventSynchronize.
I would like to ask about what is the total execution time of inference. Does it include the time of cudaEventRecord and cudaEventSynchronize or only time of (ExecutionContext::enqueue) ?
I’ve ran into some issues with my project when running DLA and GPU at the same time, so I went back to your sample.
As far as I understand, I see the same behavior in your test code.
I ran the test under nvprof in the following manner: ./test -1 → Result is shown in gpu_only.jpg
Ran ./test -1 0 → Result is shown in gpu_dla.jpg
I’ve instrumented the code with an empty cuda kernel and called it before and after the enqueue call.
That way I know when the GPU/DLA functionality starts and ends.
As can be seen in the attached images, when I run the GPU alone, it takes ~230-250ms. When the GPU and DLA run together it takes about ~500ms .
The gpu_dla.jpg shows that the dla and gpu run concurrently (at least to some degree) but it seems they block/interfere with each other. This is what I also see in my test code.
I understand that this might be due to lack of resources/network configuration/network layers etc… however the end result is that moving even a simple test to the DLA did not improve performance over running everything alone on the GPU.