Run pure conv2d node on DLA makes GPU get slower

Hi guys,

I’m testing the performance number on Orin with MAXN mode. The pure conv2d model got result as below:
a. Run GPU only: 405.3 qps.
b. Run DLA only: 132.7 qps
c. Run GPU + DLA x2: GPU 229.2 qps +DLS 113.1 qps * 2 = 455.4 qps in total

It seems that the GPU turns slow if we use GPU and DLA together. Checked some related issue on Xavier topic but I am not sure if they are caused by the same reason.

The network simple structure is as follow picture:

The commands are as follows:

/usr/src/tensorrt/bin/trtexec --onnx=conv2d.onnx --avgRuns=200 --int8 --iterations=10000 --useSpinWait
/usr/src/tensorrt/bin/trtexec --onnx=conv2d.onnx --avgRuns=200 --int8 --iterations=10000 --useSpinWait --useDLACore=0 (or 1) --allowGPUFallback( tried both with allow Fallback or not allow Fallback, seems the same)

When there is only one DLA running, the Jetson Power GUI shows that the GPU loading is arround 30%. So I use the Nsight to get the profile of this model. The following picture is running one DLA only:


Question#1. There is a D2H and a H2D for each loop, is this memcpy plus the permutation block in #2 the reason why the GPU is occupied by 30% when there is DLA only, as I assume there should be no fall-back layer?

Question#2. There is two block “void …” in blue on stream 18, actually, it’s a permutationKernelPLC3. Why there is such a block on GPU while running a network on DLA? What does this block do? Why there are two permutation blocks consecutively?

Question#3 What’s “Task 6.934ms” in DLA0, is it the wall time the DLA hardware use? Why there is about 0.5ms idle time between each Task? Is DLA waiting for the GPU permutationKernelPLC3?

Then I compare the difference between GPU_only profile and GPU+2xDLA profile, for GPU only:

and GPU+2xDLA:

Question#4 In GPU+2xDLA, the time for “trt_ampere_fp32_icudnn_int8x4…” in blue gets unstable. It could be prolonged to 4.383ms as highlighted in the GPU+2xDLA and then it’s arround 2.4ms (while in GPU only mode the time is normally arround 2.4ms as highlighted). This phenomenon happens every 2 loop. So, why does the GPU calculation turns slow? Is it influenced by the permutationKernelPLC3 in DLA process? But this permutationKernelPLC3 seems to be very short (less than 0.6ms, for 2xDLA, I assumed it could be 1.2ms, still 1.2 + 2.4 < 4.383).

Question#5 In GPU+2xDLA, for stream 18 which seems to be the calculation stream? There is a signaficant idle time between each 2 loop (about 2.28ms between every 2 blue trt_ampere block). Why is this idle time exist, while in GPU only mode the idle time is very short (about 0.05ms)?

Question#6 The DLA “Task” time is 6.934ms in DLA only mode, but in GPU+2xDLA, the DLA time prolonged to arround 7.2ms. If DLA hardware runs independantly after the data is copied to DLA related memory, why does this time becomes longer in GPU+2xDLA mode?

Should you require any further information, please do not hesitate to let me know.
I look forward to your reply.
Thanks.

Hi,

Could you share the conv2d.onnx model with us so we can reproduce the same in our environment?
Please noted that you can get some replacement information if running trtexec with the --verbose flag.

More, have you fixed the clock to the maximum before profiling.

$ sudo jetson_clocks

Since Jetson by default uses dynamic frequency, the performance may vary under this setting.

Thanks.

Hi AastaLLL,

Thanks for the quick reply.
I have just set the jetson clock to maximum as you mentioned.
The result seems to be slightly better:
The GPU is 234.8 + 2 * DLA 112.4 = 459.6
But the GPU is still slowed down comparing with the GPU only case.

Here I attach the model.
conv2d_1024_1024.onnx (9.8 KB)

After having added the verbose flag in a quick free run, there is some info about reformat pops out during the model building stage:

[07/06/2022-16:20:44] [V] [TRT] *************** Autotuning Reformat: Int8(3145728,1048576,1024,1) -> Int8(1048576,1048576:32,1024,1) ***************
[07/06/2022-16:20:44] [V] [TRT] --------------- Timing Runner: Optimizer Reformat(input to nvm -> <out>) (Reformat)
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 1000 Time: 1.16272
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 1002 Time: 0.929723
[07/06/2022-16:20:44] [V] [TRT] Setting a default quantization params because quantization data is missing for 
[07/06/2022-16:20:44] [V] [TRT] Tactic: 0 Time: 0.971511
[07/06/2022-16:20:44] [V] [TRT] Fastest Tactic: 1002 Time: 0.929723
[07/06/2022-16:20:44] [V] [TRT] *************** Autotuning Reformat: Int8(1048576,1:4,1024,1) -> Int8(3145728,1048576,1024,1) ***************
... ...

But I am not sure what does it mean. Could you help to explain it?
The log file is attached here:
DLA_only_fix_verbose.log (35.4 KB)

Hi,

Reformating is to convert the data input from float32 into int8, which runs on GPU.

[07/06/2022-16:20:47] [V] [TRT] Engine Layer Information:
Layer(Reformat): input to nvm, Tactic: 1002, input[Float(1,3,1024,1024)] -> input copy[Int8(1,3,1024,1024)]
Layer(DLA): {ForeignNode[Conv_1]}, Tactic: 3, input copy[Int8(1,3,1024,1024)] -> output copy[Int8(1,10,1016,1016)]
Layer(Reformat): output from nvm, Tactic: 1002, output copy[Int8(1,10,1016,1016)] -> output[Float(1,10,1016,1016)]
Layer(FinishNvmRegion): input copy finish, Tactic: 0, input copy[Int8(1,3,1024,1024)] -> 
Layer(FinishNvmRegion): output copy finish, Tactic: 0, output copy[Int8(1,10,1016,1016)] -> 

We are checking your model internally.
Will share more information with you later.

Thanks.

Hi AastaLLL,

Oh, I didn’t pay attention to this log. Thanks.
Lookforward to your result later.

Hi,

If we set the input and output data to int8 and dla_linear, there is no reformatting layer and GPU utilization becomes much lower.
Would you mind also giving it a try?

$ /usr/src/tensorrt/bin/trtexec --onnx=conv2d_1024_1024.onnx --useDLACore=0 --inputIOFormats=int8:dla_linear --outputIOFormats=int8:dla_linear --int8

Thanks.

Hi AastaLLL,

Just tried it and got much better result in GPU + 2xDLA (307.9 + 114.2 * 2) = 536.3 qps. Thanks.

#1 The GPU perf is still not 405 qps, just to confirm, is it the memcpy in DLA process that makes the GPU slower?

#2 I observe that while only run 1xDLA, the GPU loading is 7%~10% butfor 2xDLA the GPU loading becomes 40% ~ 50% from jetson GUI. How do we assume this?
jetsonGUI_cut

#3 In the case GPU + 2xDLA, the GPU will wait 4~5ms every 8 loops. Do you know why?

Thanks.

Hi,

We also found some performance issues when running 2x DLA and GPU concurrently.

The root cause is related to memory bandwidth and GPU scheduling.
However, we are not able to disclose the detail here.

The implementation for fixing this issue is available internally.
It will be included in our future release (not 5.0 GA).

Thanks.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.