Unable to use DLA with TensorRT

Hi,

I am trying to execute my googlenet model with TensorRT tool trtexec to benchmark the DLA. But it fails.

Here is the output :

./trtexec --deploy=deploy.prototxt --output="score_fr" --fp16 --allowGPUFallbackdeploy: deploy.prototxt
output: score_fr
fp16
allowGPUFallback
Input "data": 3x1000x1000
Output "score_fr": 3x62x62
name=data, bindingIndex=0, buffers.size()=2
name=score_fr, bindingIndex=1, buffers.size()=2
Average over 10 runs is 13.0648 ms (host walltime is 13.1717 ms, 99% percentile time is 13.6999).
Average over 10 runs is 12.8313 ms (host walltime is 13.1476 ms, 99% percentile time is 13.6221).
Average over 10 runs is 13.0187 ms (host walltime is 13.1206 ms, 99% percentile time is 13.5664).
Average over 10 runs is 12.8759 ms (host walltime is 13.0645 ms, 99% percentile time is 13.5524).
Average over 10 runs is 12.7174 ms (host walltime is 12.9099 ms, 99% percentile time is 13.6068).
Average over 10 runs is 12.9357 ms (host walltime is 13.0993 ms, 99% percentile time is 13.5779).
Average over 10 runs is 12.809 ms (host walltime is 13.0691 ms, 99% percentile time is 13.5851).
Average over 10 runs is 12.8987 ms (host walltime is 13.0356 ms, 99% percentile time is 13.6151).
Average over 10 runs is 12.9035 ms (host walltime is 13.1001 ms, 99% percentile time is 13.6209).
Average over 10 runs is 12.8646 ms (host walltime is 13.073 ms, 99% percentile time is 13.6201).

And now with the DLA option :

./trtexec --deploy=deploy.prototxt --output="score_fr" --fp16 --allowGPUFallback --useDLA=1
deploy: deploy.prototxt
output: score_fr
fp16
allowGPUFallback
useDLA: 1
Input "data": 3x1000x1000
Output "score_fr": 3x62x62
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
could not build engine
Engine could not be created
Engine could not be created

Also the speed gain compared to the TX2 is quite limited => An average of x3 in speed gain.

./giexec --deploy=deploy.prototxt --output="score_fr" --half2
deploy: deploy.prototxt
output: score_fr
half2
Input "data": 3x1000x1000
Output "score_fr": 3x62x62
name=data, bindingIndex=0, buffers.size()=2
name=score_fr, bindingIndex=1, buffers.size()=2
Average over 10 runs is 38.887 ms.
Average over 10 runs is 30.9248 ms.
Average over 10 runs is 30.9282 ms.
Average over 10 runs is 31.0399 ms.
Average over 10 runs is 31.0709 ms.
Average over 10 runs is 30.9333 ms.
Average over 10 runs is 30.9288 ms.
Average over 10 runs is 31.0606 ms.
Average over 10 runs is 31.0512 ms.
Average over 10 runs is 30.9509 ms.

Is there still some open issues on the DLA ? How can I enable the TensorCores ?

Best Regards.

Hi Austriker, are you able to run trtexec with sudo? There is a release note included with JetPack 4.1 EA about needing to run trtexec with sudo or executing trtexec from a working directory where the user has write access.

Are you able to run GPU in INT8 mode? Also which nvpmodel power mode are you using on Xavier?

Hi Dusty,

Same issue with sudo :

/usr/src/tensorrt/bin$ sudo ./trtexec --deploy=deploy.prototxt --output="score_fr" --fp16 --allowGPUFallback --useDLA=1
deploy: deploy.prototxt
output: score_fr
fp16
allowGPUFallback
useDLA: 1
Input "data": 3x1000x1000
Output "score_fr": 3x62x62
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
could not build engine
Engine could not be created
Engine could not be created

What I did for testing :

$ sudo nvpmodel -q
NV Power Mode: MAXN
0
$ sudo ./jetson_clock.sh

And it works in int8 :

/usr/src/tensorrt/bin$ ./trtexec --deploy=deploy.prototxt --output="score_fr" --int8
deploy: deploy.prototxt
output: score_fr
int8
Input "data": 3x1000x1000
Output "score_fr": 3x62x62
name=data, bindingIndex=0, buffers.size()=2
name=score_fr, bindingIndex=1, buffers.size()=2
Average over 10 runs is 6.62927 ms (host walltime is 6.73198 ms, 99% percentile time is 6.74451).
Average over 10 runs is 6.60573 ms (host walltime is 6.69867 ms, 99% percentile time is 6.70912).
Average over 10 runs is 6.60916 ms (host walltime is 6.69708 ms, 99% percentile time is 6.69008).
Average over 10 runs is 6.59638 ms (host walltime is 6.68146 ms, 99% percentile time is 6.62266).
Average over 10 runs is 6.59809 ms (host walltime is 6.68097 ms, 99% percentile time is 6.61866).
Average over 10 runs is 6.60055 ms (host walltime is 6.68902 ms, 99% percentile time is 6.64214).
Average over 10 runs is 6.59413 ms (host walltime is 6.67495 ms, 99% percentile time is 6.60608).
Average over 10 runs is 6.60316 ms (host walltime is 6.69265 ms, 99% percentile time is 6.65357).
Average over 10 runs is 6.58977 ms (host walltime is 6.66994 ms, 99% percentile time is 6.61312).
Average over 10 runs is 6.59685 ms (host walltime is 6.66984 ms, 99% percentile time is 6.6088).

Best regards

I’m not sure of Googlenet model with layer name “score_fr”. Can you first try the prototxt with output “prob” and the caffemodel from here?

https://github.com/BVLC/caffe/blob/master/models/bvlc_googlenet/deploy.prototxt
http://dl.caffe.berkeleyvision.org/bvlc_googlenet.caffemodel

It’s a custom model

With the standard model :

sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --allowGPUFallback --useDLA=1
deploy: deploy_gn.prototxt
output: prob
fp16
allowGPUFallback
useDLA: 1
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 7.64629 ms (host walltime is 7.77383 ms, 99% percentile time is 8.30611).
Average over 10 runs is 7.5626 ms (host walltime is 7.65182 ms, 99% percentile time is 7.70368).
Average over 10 runs is 7.493 ms (host walltime is 7.58619 ms, 99% percentile time is 7.59245).
Average over 10 runs is 7.44174 ms (host walltime is 7.52865 ms, 99% percentile time is 7.50813).
Average over 10 runs is 7.46562 ms (host walltime is 7.56241 ms, 99% percentile time is 7.63514).
Average over 10 runs is 7.45979 ms (host walltime is 7.55402 ms, 99% percentile time is 7.48954).
Average over 10 runs is 7.44509 ms (host walltime is 7.53902 ms, 99% percentile time is 7.53344).
Average over 10 runs is 7.43022 ms (host walltime is 7.53287 ms, 99% percentile time is 7.4872).
Average over 10 runs is 7.42556 ms (host walltime is 7.51456 ms, 99% percentile time is 7.49254).
Average over 10 runs is 7.42478 ms (host walltime is 7.51752 ms, 99% percentile time is 7.45405).
sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --allowGPUFallback --useDLA=1 --model=bvlc_googlenet.caffemodel
deploy: deploy_gn.prototxt
output: prob
fp16
allowGPUFallback
useDLA: 1
model: bvlc_googlenet.caffemodel
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 7.63068 ms (host walltime is 7.73798 ms, 99% percentile time is 8.35382).
Average over 10 runs is 7.4754 ms (host walltime is 7.56465 ms, 99% percentile time is 7.56941).
Average over 10 runs is 7.4509 ms (host walltime is 7.54227 ms, 99% percentile time is 7.50518).
Average over 10 runs is 7.4246 ms (host walltime is 7.51146 ms, 99% percentile time is 7.45379).
Average over 10 runs is 7.47307 ms (host walltime is 7.56497 ms, 99% percentile time is 7.66563).
Average over 10 runs is 7.41936 ms (host walltime is 7.51703 ms, 99% percentile time is 7.50883).
Average over 10 runs is 7.42538 ms (host walltime is 7.51459 ms, 99% percentile time is 7.52621).
Average over 10 runs is 7.42108 ms (host walltime is 7.51216 ms, 99% percentile time is 7.45984).
Average over 10 runs is 7.39882 ms (host walltime is 7.49564 ms, 99% percentile time is 7.46029).
Average over 10 runs is 7.40624 ms (host walltime is 7.48863 ms, 99% percentile time is 7.44243).
sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --useDLA=1 --model=bvlc_googlenet.caffemodel
deploy: deploy_gn.prototxt
output: prob
fp16
useDLA: 1
model: bvlc_googlenet.caffemodel
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA and falling back to GPU is not enabled.
could not build engine
Engine could not be created
Engine could not be created

OK, it looks like you were able to run the standard Googlenet on DLA. Note that in the JetPack Developer Preview Early Access, the networks officially tested on DLA include Alexnet, Googlenet, Resnet, and VGG. The customized layers in your model may be what is causing it to not run on DLA.

Ok so I tested a network for semantic segmentation following jetson inference tutorial. https://raw.githubusercontent.com/NVIDIA/DIGITS/master/examples/semantic-segmentation/fcn_alexnet.prototxt

But I still can’t manage to launch it.

sudo ./trtexec --deploy=FCN-Alexnet-Pascal-VOC/fcn_alexnet.deploy.prototxt --output="score_fr_21classes" --fp16 --useDLA=1
deploy: FCN-Alexnet-Pascal-VOC/fcn_alexnet.deploy.prototxt
output: score_fr_21classes
fp16
useDLA: 1
Input "data": 3x500x356
Output "score_fr_21classes": 21x16x12
Default DLA is enabled but layer conv1 is not running on DLA and falling back to GPU is not enabled.
could not build engine
Engine could not be created
Engine could not be created

It is saying there is a layer that can’t run on DLA, and GPU fallback isn’t enabled so it couldn’t build the TRT engine.
You could try launching it with --allowGPUFallback

Tried that also :

sudo ./trtexec --deploy=FCN-Alexnet-Pascal-VOC/fcn_alexnet.deploy.prototxt --output="score_fr_21classes" --fp16 --useDLA=1 --allowGPUFallback
[sudo] password for nvidia: 
deploy: FCN-Alexnet-Pascal-VOC/fcn_alexnet.deploy.prototxt
output: score_fr_21classes
fp16
useDLA: 1
allowGPUFallback
Input "data": 3x500x356
Output "score_fr_21classes": 21x16x12
Default DLA is enabled but layer conv1 is not running on DLA, falling back to GPU.
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
../builder/cudnnBuilder2.cpp (689) - Misc Error in buildSingleLayer: 1 (Unable to process layer.)
could not build engine
Engine could not be created
Engine could not be created

it works with googlenet :

sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --useDLA=1 --allowGPUFallback
deploy: deploy_gn.prototxt
output: prob
fp16
useDLA: 1
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 7.61525 ms (host walltime is 7.71438 ms, 99% percentile time is 8.42531).
Average over 10 runs is 7.45112 ms (host walltime is 7.5426 ms, 99% percentile time is 7.54054).
Average over 10 runs is 7.47984 ms (host walltime is 7.56824 ms, 99% percentile time is 7.56355).
Average over 10 runs is 7.46052 ms (host walltime is 7.54673 ms, 99% percentile time is 7.6247).
Average over 10 runs is 7.43091 ms (host walltime is 7.5198 ms, 99% percentile time is 7.4839).
Average over 10 runs is 7.42442 ms (host walltime is 7.51952 ms, 99% percentile time is 7.56941).
Average over 10 runs is 7.41408 ms (host walltime is 7.50007 ms, 99% percentile time is 7.4983).
Average over 10 runs is 7.42892 ms (host walltime is 7.52415 ms, 99% percentile time is 7.60461).
Average over 10 runs is 7.41212 ms (host walltime is 7.49461 ms, 99% percentile time is 7.44051).
Average over 10 runs is 7.46012 ms (host walltime is 7.54797 ms, 99% percentile time is 7.55254).

but it’s slower with the DLA (3 times slower) :

$ sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --useDLA=1 --allowGPUFallback
deploy: deploy_gn.prototxt
output: prob
fp16
useDLA: 1
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 7.61525 ms (host walltime is 7.71438 ms, 99% percentile time is 8.42531).
Average over 10 runs is 7.45112 ms (host walltime is 7.5426 ms, 99% percentile time is 7.54054).
Average over 10 runs is 7.47984 ms (host walltime is 7.56824 ms, 99% percentile time is 7.56355).
Average over 10 runs is 7.46052 ms (host walltime is 7.54673 ms, 99% percentile time is 7.6247).
Average over 10 runs is 7.43091 ms (host walltime is 7.5198 ms, 99% percentile time is 7.4839).
Average over 10 runs is 7.42442 ms (host walltime is 7.51952 ms, 99% percentile time is 7.56941).
Average over 10 runs is 7.41408 ms (host walltime is 7.50007 ms, 99% percentile time is 7.4983).
Average over 10 runs is 7.42892 ms (host walltime is 7.52415 ms, 99% percentile time is 7.60461).
Average over 10 runs is 7.41212 ms (host walltime is 7.49461 ms, 99% percentile time is 7.44051).
Average over 10 runs is 7.46012 ms (host walltime is 7.54797 ms, 99% percentile time is 7.55254).

$ sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --useDLA=2 --allowGPUFallback
deploy: deploy_gn.prototxt
output: prob
fp16
useDLA: 2
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 8.35648 ms (host walltime is 8.51915 ms, 99% percentile time is 13.578).
Average over 10 runs is 7.60057 ms (host walltime is 7.71397 ms, 99% percentile time is 7.88573).
Average over 10 runs is 7.45977 ms (host walltime is 7.54657 ms, 99% percentile time is 7.49731).
Average over 10 runs is 7.46231 ms (host walltime is 7.54995 ms, 99% percentile time is 7.6144).
Average over 10 runs is 7.46377 ms (host walltime is 7.55141 ms, 99% percentile time is 7.57261).
Average over 10 runs is 7.45527 ms (host walltime is 7.5493 ms, 99% percentile time is 7.51965).
Average over 10 runs is 7.43578 ms (host walltime is 7.52406 ms, 99% percentile time is 7.4935).
Average over 10 runs is 7.43039 ms (host walltime is 7.51354 ms, 99% percentile time is 7.47037).
Average over 10 runs is 7.42859 ms (host walltime is 7.51587 ms, 99% percentile time is 7.57533).
Average over 10 runs is 7.43494 ms (host walltime is 7.52745 ms, 99% percentile time is 7.5119).

$ sudo ./trtexec --deploy=deploy_gn.prototxt --output="prob" --fp16 --allowGPUFallback
deploy: deploy_gn.prototxt
output: prob
fp16
allowGPUFallback
Input "data": 3x224x224
Output "prob": 1000x1x1
name=data, bindingIndex=0, buffers.size()=2
name=prob, bindingIndex=1, buffers.size()=2
Average over 10 runs is 2.33557 ms (host walltime is 2.42281 ms, 99% percentile time is 2.53942).
Average over 10 runs is 2.30036 ms (host walltime is 2.38397 ms, 99% percentile time is 2.30755).
Average over 10 runs is 2.30158 ms (host walltime is 2.38179 ms, 99% percentile time is 2.30685).
Average over 10 runs is 2.30062 ms (host walltime is 2.37806 ms, 99% percentile time is 2.30912).
Average over 10 runs is 2.30475 ms (host walltime is 2.38379 ms, 99% percentile time is 2.3351).
Average over 10 runs is 2.30268 ms (host walltime is 2.38351 ms, 99% percentile time is 2.30605).
Average over 10 runs is 2.29729 ms (host walltime is 2.37373 ms, 99% percentile time is 2.30205).
Average over 10 runs is 2.30181 ms (host walltime is 2.38668 ms, 99% percentile time is 2.31046).
Average over 10 runs is 2.30885 ms (host walltime is 2.4033 ms, 99% percentile time is 2.36349).
Average over 10 runs is 2.30052 ms (host walltime is 2.39568 ms, 99% percentile time is 2.30486).

Hi Austriker, DLA is expected to be slower than GPU, however DLA is more energy efficient.

Hi Dusty,

Since It’s called Deep Learning Accelerator I thought it was faster.
Regarding the FCN Alexnet is semantic segmentation supported by the DLA ?

The initial DLA support officially in JetPack EA release is for Alexnet (not FCN), Googlenet, ResNet50, and VGG.