Unexpected performance loss when using GPU, DLA0, DLA1 simultaneously

yjkim2 · April 27, 2021, 3:19pm

Hi,
I’m trying to run 3 networks concurrently. 1 on DLA0, 1 on DLA1 and the third on the GPU.
To do that, I used TensorRT_sample.zip and typed these three commands ‘./test -1’, ‘./test -1 0’, ‘./test -1 0 1’.

‘./test -1’

###@###:~/Desktop/yjkim/trtexec_mt/TensorRT_sample$ ./test -1
Load engine from :gpu.engine
FPS:    195.982
FPS:    211.955
FPS:    212.943
FPS:    212.962
FPS:    212.963
FPS:    212.955
FPS:    211.939
FPS:    212.942
FPS:    212.963
FPS:    212.961
FPS:    213.962
FPS:    211.942
FPS:    212.955
FPS:    213.964
FPS:    211.965
FPS:    212.952
FPS:    213.952
FPS:    213.962
FPS:    210.941

‘./test -1 0’

###@###:~/Desktop/yjkim/trtexec_mt/TensorRT_sample$ ./test -1 0
Load engine from :gpu.engine
Load engine from :dla.engine
FPS:    178.979 7.99939
FPS:    199.962 7.99857
FPS:    198.964 7.99863
FPS:    198.964 8.99838
FPS:    199.958 7.99829
FPS:    198.963 7.99852
FPS:    198.944 7.99775
FPS:    198.96  7.99837
FPS:    198.968 8.99858
FPS:    199.966 7.9983
FPS:    199.958 7.99864
FPS:    197.961 7.99844
FPS:    198.97  8.99829
FPS:    198.959 7.99862
FPS:    198.959 7.99837
FPS:    198.969 7.99876
FPS:    198.969 8.99817
FPS:    199.958 7.99855
FPS:    198.963 7.99863
FPS:    197.967 7.99869
FPS:    198.963 8.99799

The FPS of gpu.engine dropped down from 212 to 198. I understand that it could happen because of some kinds of resource contention.

However when it comes to ‘./test -1 0 1’,

###@###r:~/Desktop/yjkim/trtexec_mt/TensorRT_sample$ ./test -1 0 1
Load engine from :gpu.engine
Load engine from :dla.engine
Load engine from :dla.engine
FPS:    14.9949 6.99752 6.99888
FPS:    14.9948 7.99762 7.99672
FPS:    13.9954 6.99773 6.99853
FPS:    16.9964 8.99813 8.99814
FPS:    14.9967 7.99825 7.99825
FPS:    17.9962 8.9978  8.9978
FPS:    15.9965 7.99852 7.99852
FPS:    17.9951 8.99753 8.99753
FPS:    15.9955 7.99774 7.99774
FPS:    15.9954 7.99771 7.99771
FPS:    17.996  8.99801 8.998
FPS:    15.9957 7.99782 7.99782
FPS:    17.9961 8.99808 8.99808
FPS:    15.9962 7.99809 7.9978
FPS:    16.9965 8.99814 8.99848
FPS:    15.9967 7.998   7.998
FPS:    15.9965 8.99839 8.99837
FPS:    15.9957 7.99785 7.99757
FPS:    16.9963 8.99808 8.9984
FPS:    15.9964 7.99784 7.99784
FPS:    17.9963 8.99852 8.99853
FPS:    14.997  7.9984  7.99839

The FPS of gpu.engine became 14.997 and I think 14.977 is too low.

GPU ==> 210
GPU+DLA0 ==> 198
GPU+DLA0+DLA1 ==> 14??

Could you give me some explanation about this unexpected performance loss on the gpu.engine when using 3 networks concurrently?

===

I took a closer look at this phenomenon using visual profiler nsight.
When running gpu.engine and dla0.engine at the same time,

as you can see in this figure, those two threads run asynchronously very well.
Average inference time of gpu.engine = 12.XX ms
Average inference time of dla0.engine = 120.XX ms

However. when running gpu.engine, dla0.engine, dla1.engine at the same time,

it seems that some inferences of gpu.engine wait until the inferences of dla0.engine are finished.
Or at least it seems that gpu.engine was blocked by something else, which might cause unexpected loss.

However, I don’t know how to fix this issue. Is this a problem of sample code? or TensorRT API?
Is there any way to run all three threads asynchronously without any unexpected performance loss?

Any help will be greatly appreciated.

Thanks.

yjkim.

AastaLLL · April 28, 2021, 2:32am

Hi,

Could you apply the following command and do it again?

$ export CUDA_DEVICE_MAX_CONNECTIONS=32

Thanks.

yjkim2 · April 28, 2021, 3:57am

Thanks a lot!

yjkim.

fpsychosis · April 28, 2021, 9:08am

Was this fixed with that command?
I have planned a similar setup , Should I add that command to my .bashrc, would that affect the performance in other scenario?
Give a little of further explanation about what is that command should be great

AastaLLL · May 4, 2021, 11:37am

Hi,

You can find the explanation of CUDA_DEVICE_MAX_CONNECTIONS below:

https://docs.nvidia.com/deploy/mps/index.html#topic_5_2_4

In general, the parameter indicates the number of compute channels.
The default value is 4.

Thanks.

fpsychosis · May 4, 2021, 5:18pm

I dont know what that mean but I gonna put the command in my .bashrc .
If nothing explote, will be fine.
Cheers!