After using dla, the speed is slower

lyzeng · June 24, 2019, 11:59am

HI ，

We tried DLA acceleration. But when we run sample, we find that the speed is slower. Is this normal? Or is there something wrong with us?

here goes the steps：
1、We go into the samples folder of TensorRT and then into the sampleINT8 folder。
2、Compile the code without any modifications to the code and model.
3、Execute the corresponding executable file with the correct parameters。
4、get the result。

the information about our hardware and software：
AGX Xavier
TensorRT5.0.3

（1）run the sample_int8 with DLA and got the following message：
nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./sample_int8 mnist useDLACore=1

DLA requested. Disabling for FP32 run since its not supported.

FP32 run:400 batches of size 30 starting at 100
…
Top1: 0.989833, Top5: 1
Processing 12000 images averaged 0.0176231 ms/image and 0.528693 ms/batch.

FP16 run:400 batches of size 30 starting at 100
WARNING: Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
…
Top1: 0.92925, Top5: 0.9675
Processing 12000 images averaged 0.193493 ms/image and 5.80478 ms/batch.

DLA requested. Disabling for Int8 run since its not supported.

INT8 run:400 batches of size 30 starting at 100
…
Top1: 0.990167, Top5: 1
Processing 12000 images averaged 0.0362652 ms/image and 1.08796 ms/batch.

（2）run the sample_int8 without DLA and got the following message：
nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./sample_int8 mnist

FP32 run:400 batches of size 30 starting at 100
…
Top1: 0.989833, Top5: 1
Processing 12000 images averaged 0.0176359 ms/image and 0.529076 ms/batch.

FP16 run:400 batches of size 30 starting at 100
…
Top1: 0.98975, Top5: 1
Processing 12000 images averaged 0.0169346 ms/image and 0.508038 ms/batch.

INT8 run:400 batches of size 30 starting at 100
…
Top1: 0.990167, Top5: 1
Processing 12000 images averaged 0.0144124 ms/image and 0.432372 ms/batch.

from the above messages，we can got a slower speed when we use DLA with FP16.Even the speed 10 times slower than that without use DLA.

Is it normal or we need do some changes.

Thanks

SteveNV · June 24, 2019, 11:30pm

Dear lyzeng,

Could you please refer to below link for your topic?

•Max batch size supported is 32.

And refer to below link for TensorRT and DLA on DRIVE AGX.
Object Detection and Lane Segmentation Using Multiple Accelerators with DRIVE AGX
https://devblogs.nvidia.com/drive-agx-accelerators-object-detection/

lyzeng · June 27, 2019, 9:08am

Thanks for your reply.

"Developer Guide :: NVIDIA Deep Learning TensorRT Documentation
•Max batch size supported is 32. "

We already set the batch size to 30. The current batch size is within the prescribed scope.

Also，we tried to set the batch size to other values, but the results are still so pool.
Another problem is that when we change the size, the prediction accuracy of the model varies greatly, and sometimes the prediction accuracy is very low.

Here are the results：
batch size is 2[i][/i]
nvidia@tegra-ubuntu:/usr/src/tensorrt/bin$ ./sample_int8 mnist useDLACore=0

DLA requested. Disabling for FP32 run since its not supported.

FP32 run:1000 batches of size 2 starting at 500 （we set the batch size to 2）

…
…
Top1: 0.984, Top5: 1
Processing 2000 images averaged 0.156109 ms/image and 0.312217 ms/batch.

FP16 run:1000 batches of size 2 starting at 500
WARNING: Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.
…
…
Top1: 0.0795, Top5: 0.487 (the prediction accuracy is very low)
Processing 2000 images averaged 0.528846 ms/image and 1.05769 ms/batch.(the speed is slower)

DLA requested. Disabling for Int8 run since its not supported.

INT8 run:1000 batches of size 2 starting at 500
…
…
Top1: 0.984, Top5: 1
Processing 2000 images averaged 0.15653 ms/image and 0.31306 ms/batch.

why the prediction accuracy vary so much with batch size？

thanks

AastaLLL · July 5, 2019, 8:50am

Hi,

Top1: 0.0795, Top5: 0.487 (***the prediction accuracy is very low***)

This make no sense to me. Suppose there is something incorrect.
May I know the model you use?
It looks like your model cannot be fully supported by the DLA.

WARNING: Default DLA is enabled but layer prob is not running on DLA, falling back to GPU.

Once a layer fallback to the GPU, the performance may be limited by the bandwidth speed.

Thanks.