DLA purpose

Hi,

I am using the Jetson AGX Xavier with the latest JetPack 4.1.1 (TensorRT 5.0)

Why Nvidia added 2 DLA’s to the Xavier and not just increase the cuda-cores and tensor-cores?

When I used trtexec with ResNet50 on MAXN mode, I discovered the GPU is faster than the DLA.

The output of running on 1 DLA:

avgRuns: 1000
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
fp16
batch: 1
iterations: 5
output: prob
useSpinWait
useDLACore: 0
allowGPUFallback
Input “data”: 3x224x224
Output “prob”: 1000x1x1

Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 1000 runs is 7.63907 ms (host walltime is 7.72017 ms, 99% percentile time is 7.86941).

The output of running on the GPU:

avgRuns: 1000
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
fp16
batch: 1
iterations: 5
output: prob
useSpinWait
Input “data”: 3x224x224
Output “prob”: 1000x1x1

name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 1000 runs is 3.49843 ms (host walltime is 3.54138 ms, 99% percentile time is 5.46234).

So I do not really understand what is the advantage of using the DLA over the GPU?

Thanks,
Bental

Hi,

You can find some Xavier introduction here:
https://devblogs.nvidia.com/nvidia-jetson-agx-xavier-32-teraops-ai-robotics/

Nvidia DLA is designed specifically for the deep learning use case and is used for offload the inference effort from GPU.
These engines improve energy efficiency and free up the GPU to run more complex networks and dynamic tasks implemented by the user.

Thanks.