I am using the Jetson AGX Xavier with the latest JetPack 4.1.1 (TensorRT 5.0)
Why Nvidia added 2 DLA’s to the Xavier and not just increase the cuda-cores and tensor-cores?
When I used trtexec with ResNet50 on MAXN mode, I discovered the GPU is faster than the DLA.
The output of running on 1 DLA:
avgRuns: 1000
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
batch: 1
iterations: 5
output: prob
useDLACore: 0
Input “data”: 3x224x224
Output “prob”: 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 1000 runs is 7.63907 ms (host walltime is 7.72017 ms, 99% percentile time is 7.86941).
The output of running on the GPU:
avgRuns: 1000
deploy: /home/nvidia/Networks/ResNet-50/deploy.prototxt
batch: 1
iterations: 5
output: prob
Input “data”: 3x224x224
Output “prob”: 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 1000 runs is 3.49843 ms (host walltime is 3.54138 ms, 99% percentile time is 5.46234).
So I do not really understand what is the advantage of using the DLA over the GPU?