Run pure conv2d node on DLA makes GPU get slower

IMTBretagne · July 5, 2022, 10:48am

Hi guys,

I’m testing the performance number on Orin with MAXN mode. The pure conv2d model got result as below:
a. Run GPU only: 405.3 qps.
b. Run DLA only: 132.7 qps
c. Run GPU + DLA x2: GPU 229.2 qps +DLS 113.1 qps * 2 = 455.4 qps in total

It seems that the GPU turns slow if we use GPU and DLA together. Checked some related issue on Xavier topic but I am not sure if they are caused by the same reason.

The network simple structure is as follow picture:

The commands are as follows:

/usr/src/tensorrt/bin/trtexec --onnx=conv2d.onnx --avgRuns=200 --int8 --iterations=10000 --useSpinWait
/usr/src/tensorrt/bin/trtexec --onnx=conv2d.onnx --avgRuns=200 --int8 --iterations=10000 --useSpinWait --useDLACore=0 (or 1) --allowGPUFallback( tried both with allow Fallback or not allow Fallback, seems the same)

When there is only one DLA running, the Jetson Power GUI shows that the GPU loading is arround 30%. So I use the Nsight to get the profile of this model. The following picture is running one DLA only：

Question#1. There is a D2H and a H2D for each loop, is this memcpy plus the permutation block in #2 the reason why the GPU is occupied by 30% when there is DLA only, as I assume there should be no fall-back layer?

Question#2. There is two block “void …” in blue on stream 18, actually, it’s a permutationKernelPLC3. Why there is such a block on GPU while running a network on DLA? What does this block do? Why there are two permutation blocks consecutively?

Question#3 What’s “Task 6.934ms” in DLA0, is it the wall time the DLA hardware use? Why there is about 0.5ms idle time between each Task? Is DLA waiting for the GPU permutationKernelPLC3?

Then I compare the difference between GPU_only profile and GPU+2xDLA profile, for GPU only:

and GPU+2xDLA:

Question#4 In GPU+2xDLA, the time for “trt_ampere_fp32_icudnn_int8x4…” in blue gets unstable. It could be prolonged to 4.383ms as highlighted in the GPU+2xDLA and then it’s arround 2.4ms (while in GPU only mode the time is normally arround 2.4ms as highlighted). This phenomenon happens every 2 loop. So, why does the GPU calculation turns slow? Is it influenced by the permutationKernelPLC3 in DLA process? But this permutationKernelPLC3 seems to be very short (less than 0.6ms, for 2xDLA, I assumed it could be 1.2ms, still 1.2 + 2.4 < 4.383).

Question#5 In GPU+2xDLA, for stream 18 which seems to be the calculation stream? There is a signaficant idle time between each 2 loop (about 2.28ms between every 2 blue trt_ampere block). Why is this idle time exist, while in GPU only mode the idle time is very short (about 0.05ms)?

Question#6 The DLA “Task” time is 6.934ms in DLA only mode, but in GPU+2xDLA, the DLA time prolonged to arround 7.2ms. If DLA hardware runs independantly after the data is copied to DLA related memory, why does this time becomes longer in GPU+2xDLA mode?

Should you require any further information, please do not hesitate to let me know.
I look forward to your reply.
Thanks.

AastaLLL · July 6, 2022, 5:17am

Hi,

Could you share the conv2d.onnx model with us so we can reproduce the same in our environment?
Please noted that you can get some replacement information if running trtexec with the --verbose flag.

More, have you fixed the clock to the maximum before profiling.

$ sudo jetson_clocks

Since Jetson by default uses dynamic frequency, the performance may vary under this setting.

Thanks.

IMTBretagne · July 6, 2022, 8:30am

Hi AastaLLL,

Thanks for the quick reply.
I have just set the jetson clock to maximum as you mentioned.
The result seems to be slightly better:
The GPU is 234.8 + 2 * DLA 112.4 = 459.6
But the GPU is still slowed down comparing with the GPU only case.

Here I attach the model.
conv2d_1024_1024.onnx (9.8 KB)

After having added the verbose flag in a quick free run, there is some info about reformat pops out during the model building stage:

[07/06/2022-16:20:44] [V] [TRT] *************** Autotuning Reformat: Int8(3145728,1048576,1024,1) -> Int8(1048576,1048576:32,1024,1) ***************
[07/06/2022-16:20:44] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(input to nvm -> <out>) (Reformat)
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 1000 Time: 1.16272
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 1002 Time: 0.929723
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 0 Time: 0.971511
[07/06/2022-16:20:44] [V] [TRT] Fastest Tactic: 1002 Time: 0.929723
[07/06/2022-16:20:44] [V] [TRT] *************** Autotuning Reformat: Int8(1048576,1:4,1024,1) -> Int8(3145728,1048576,1024,1) ***************
... ...

But I am not sure what does it mean. Could you help to explain it?
The log file is attached here:
DLA_only_fix_verbose.log (35.4 KB)

AastaLLL · July 8, 2022, 5:10am

Hi,

Reformating is to convert the data input from float32 into int8, which runs on GPU.

[07/06/2022-16:20:47] [V] [TRT] Engine Layer Information:
Layer(Reformat): input to nvm, Tactic: 1002, input[Float(1,3,1024,1024)] -> input copy[Int8(1,3,1024,1024)]
Layer(DLA): {ForeignNode[Conv_1]}, Tactic: 3, input copy[Int8(1,3,1024,1024)] -> output copy[Int8(1,10,1016,1016)]
Layer(Reformat): output from nvm, Tactic: 1002, output copy[Int8(1,10,1016,1016)] -> output[Float(1,10,1016,1016)]
Layer(FinishNvmRegion): input copy finish, Tactic: 0, input copy[Int8(1,3,1024,1024)] -> 
Layer(FinishNvmRegion): output copy finish, Tactic: 0, output copy[Int8(1,10,1016,1016)] ->

We are checking your model internally.
Will share more information with you later.

Thanks.

IMTBretagne · July 8, 2022, 5:32am

Hi AastaLLL,

Oh, I didn’t pay attention to this log. Thanks.
Lookforward to your result later.

AastaLLL · July 11, 2022, 5:53am

Hi,

If we set the input and output data to int8 and dla_linear, there is no reformatting layer and GPU utilization becomes much lower.
Would you mind also giving it a try?

$ /usr/src/tensorrt/bin/trtexec --onnx=conv2d_1024_1024.onnx --useDLACore=0 --inputIOFormats=int8:dla_linear --outputIOFormats=int8:dla_linear --int8

Thanks.

IMTBretagne · July 11, 2022, 8:09am

Hi AastaLLL,

Just tried it and got much better result in GPU + 2xDLA (307.9 + 114.2 * 2) = 536.3 qps. Thanks.

#1 The GPU perf is still not 405 qps, just to confirm, is it the memcpy in DLA process that makes the GPU slower?

#2 I observe that while only run 1xDLA, the GPU loading is 7%~10% butfor 2xDLA the GPU loading becomes 40% ~ 50% from jetson GUI. How do we assume this?
jetsonGUI_cut

#3 In the case GPU + 2xDLA, the GPU will wait 4~5ms every 8 loops. Do you know why?

Thanks.

AastaLLL · July 12, 2022, 6:43am

Hi,

We also found some performance issues when running 2x DLA and GPU concurrently.

The root cause is related to memory bandwidth and GPU scheduling.
However, we are not able to disclose the detail here.

The implementation for fixing this issue is available internally.
It will be included in our future release (not 5.0 GA).

Thanks.

system · August 3, 2022, 2:30am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
DLA and GPU cores at the same time Jetson AGX Xavier dla	20	10633	October 18, 2021
Deep Learning Accelerator problems DRIVE AGX Xavier General	2	1491	October 12, 2021
DLA and GPU running at the same time - performance question Jetson AGX Xavier nvbugs , performance , dla	24	3354	October 18, 2021
DLA performance is not as expected Jetson AGX Orin dla	7	416	August 14, 2024
DLA / GPU question Jetson AGX Xavier dla	6	1041	October 18, 2021
Does DLA work faster than GPU in fp16 model? Jetson AGX Xavier dla	18	3152	June 8, 2022
Why run slower when use DLA and GPU together , even if the DLA model was transfromed all in DLA? Jetson Xavier NX dla	7	1365	October 18, 2021
When GPU and DLA are used at the same time, the time consumption increases with each other DRIVE AGX Orin General dla , driveos-dl	10	978	March 9, 2023
GeMM performance on Orin DLA Jetson AGX Orin tensorrt , cuda , jetson-inference	10	1067	February 21, 2024
Jetson Orin: All layers pushed to GPU, zero layers on DLA Jetson AGX Orin tensorrt , dla	7	1131	April 26, 2023

Run pure conv2d node on DLA makes GPU get slower

Related topics