why DLA 1x1 convolution cannot get double throughput

suntao2012 · August 21, 2019, 6:01am

here is the inference benchmark data I tested on xavier DLA, with trtexec tool,
I use the prototxt with only one convolution layer, and use different kernel size 1x1 / 3x3,
I found that, 1x1 kernel cannot get double throughput as 3x3 kernel, why?

						kernel 1x1	kernel 1x1		kernel 3x3	kernel 3x3

in-channel in-Height in-Width out-channel out-Height out-Width int8-time[ms] fp16-time int8-time fp16-time
528 14 14 256 14 14 2.08769 2.11043 4.61778 10.1543
528 14 14 256 14 14 2.07632 2.10025 4.60823 10.1317
528 14 14 256 14 14 2.07486 2.1022 4.43135 10.1381
528 14 14 256 14 14 2.07529 2.09578 4.40238 10.1287
528 14 14 256 14 14 2.07222 2.09045 4.38512 10.1321
528 14 14 256 14 14 2.07541 2.09187 4.38848 10.1331
528 14 14 256 14 14 2.0712 2.09242 4.38632 10.1239
528 14 14 256 14 14 2.07254 2.0955 4.38462 10.1363
528 14 14 256 14 14 2.07126 2.09185 4.38445 10.1374
528 14 14 256 14 14 2.07068 2.09136 4.38677 10.2717

AastaLLL · August 21, 2019, 7:20am

Hi,

Not all the combinations of convolution is supported by the DLA.
Would you mind to check if your usecase is in the list with this document first:
[url]Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

If a layer cannot be inferenced with DLA, it will automatically fallback to the GPU.
And the acceleration will not be available.

Thanks.

suntao2012 · August 21, 2019, 7:37am

sure, I had checked, all these cases run on DLA. Thank you anyway.

AastaLLL · August 22, 2019, 2:59am

Hi,

To give a further suggestion, we want to pass this issue to our internal team.

Would you mind to share a sample/model of your use case?
So we can reproduce exactly the same issue in our environment?

Thanks

suntao2012 · August 22, 2019, 3:42am

problem: DLA convolution INT8 cannot get double throughput to FP16 when kernel size is 1x1; but it doubled when kernel size is 3x3.
use cases: 2 prototxts with kernel size 1x1 or 3x3, use trtexec to test throughput
commands are like this:
1x1:
./trtexec --deploy=kernel_1x1.prototxt --output=conv1 --useDLACore=0 --batch=32 --int8 …
./trtexec --deploy=kernel_1x1.prototxt --output=conv1 --useDLACore=0 --batch=32 --fp16 …
3x3:
./trtexec --deploy=kernel_3x3.prototxt --output=conv1 --useDLACore=0 --batch=32 --int8 …
./trtexec --deploy=kernel_3x3.prototxt --output=conv1 --useDLACore=0 --batch=32 --fp16 …

-------kernel_1x1.prototxt-------------------------
layer {
name: “input”
type: “Input”
top: “input”
input_param {
shape {
dim: 1
dim: 528
dim: 14
dim: 14
}
}
}
layer {
name: “conv1”
type: “Convolution”
bottom: “input”
top: “conv1”
convolution_param {
num_output: 256
bias_term: false
pad: 0
kernel_size: 1
weight_filler {
type: “uniform”
min: 0.0
max: 1.0
}
}
}

second one is 3x3 kernel, like this:

----------kernel_3x3.prototxt-----------
layer {
name: “input”
type: “Input”
top: “input”
input_param {
shape {
dim: 1
dim: 528
dim: 14
dim: 14
}
}
}
layer {
name: “conv1”
type: “Convolution”
bottom: “input”
top: “conv1”
convolution_param {
num_output: 256
bias_term: false
pad: 1
kernel_size: 3
weight_filler {
type: “uniform”
min: 0.0
max: 1.0
}
}
}

AastaLLL · September 2, 2019, 7:43am

Hi,

Thanks for your update.

We are trying to reproduce this issue in our environment.
Will update more information with you later.

AastaLLL · September 10, 2019, 9:11am

Hi,

For the small layers or networks, the speedup is not that evident.

Also, there is some overhead with TensorRT API like reformat layers.
In TensorRT-5.0, there are reformat layers at the beginning and end of the TensorRT engine generated for DLA.
These reformat layers run on GPU

We are try to breakdown the run-time implementation and check it further.
Will update more information with you later.

Thanks.

suntao2012 · September 10, 2019, 9:59am

thank you for the answer.

sure the layer input data size is small, so the speedup cannot be doubled because of other overhead.

but why the convolution with 3x3 filter could get double performance, they have the same input data size, don’t they?

is it because of 1x1 filter got less gemm computation then 3x3 filter?

AastaLLL · September 23, 2019, 6:47am

Hi,

This is because most of the time is occupied by the reformation overhead.
Please see the breakdown profiling result as below:

<b>INT8</b>
input to nvm                             1.016ms
{conv1}                                  0.055ms
input copy finish                        0.011ms
conv1 from nvm                           1.070ms
conv1 copy finish                        0.005ms
Time over all layers: 2.158

<b>FP16</b>
input to nvm                             0.970ms
{conv1}                                  0.052ms
input copy finish                        0.013ms
conv1 from nvm                           1.126ms
conv1 copy finish                        0.005ms
Time over all layers: 2.166

Thanks.