When DLA is enabled on NX, the speed is slower

I constructed a simple network as bellow:

0: (Unnamed Layer* 0) [Convolution], 0
1: (Unnamed Layer* 1) [Convolution], 0

which are from the first two convolution layers of mobilenetV2.

I set the input size to 384*768, batchsize to 1, and set to int8 mode,

the time cost is:

3.750966ms on DLA
1.342228ms on GPU

I am very confused by this result. Could you please tell me why this happens?



Nvidia DLA is designed specifically for the deep learning use case and is used to offload the GPU’s inference effort.
These engines improve energy efficiency and free up the GPU to run more complex networks or dynamic tasks implemented by the user.

It doesn’t target for performance but energy efficiency.
You can find some performance data in our Deepstream document to compare the results: