Profiling a network running on the DLA

Hi,
I’m trying to understand what takes most of the time in a network running on the DLA.
Below is trtexec’s dump for this network, running on the DLA (with some layers fallback to the GPU):

More specificaly, what are those “output reformatter XX” lines? reformatting input/output or the actual layers running on the DLA (and as such can not be optimized)?

77.3% + 17.1% → amounts for most of the time

Blockquote
[02/01/2021-18:55:10] [I] ResNet18v2/conv0/Conv2D__27 output reformatter 0 9.14 0.12 0.3
[02/01/2021-18:55:10] [I] {ResNet18v2/conv0/Conv2D,ResNet18v2/Relu,ResNet18v2/stage1_maxpool,ResNet18v2/group0/block0/conv1/Conv2D,ResNet18v2/Relu_1,ResNet18v2/group0/block0/conv2/Conv2D,ResNet18v2/stage2_add1,ResNet18v2/stage2_relu1,ResNet18v2/group0/block1/conv1/Conv2D,ResNet18v2/Relu_2,ResNet18v2/group0/block1/conv2/Conv2D,ResNet18v2/stage2_add2,ResNet18v2/stage2_relu2,ResNet18v2/group1/block0/convshortcut/Conv2D,ResNet18v2/group1/block0/conv1/Conv2D,ResNet18v2/Relu_3,ResNet18v2/group1/block0/conv2/Conv2D,ResNet18v2/stage3_add1,ResNet18v2/stage3_relu1,ResNet18v2/group1/block1/conv1/Conv2D,ResNet18v2/Relu_4,ResNet18v2/group1/block1/conv2/Conv2D,ResNet18v2/stage3_add2,ResNet18v2/stage3_relu2,ResNet18v2/group2/block0/convshortcut/Conv2D,ResNet18v2/group2/block0/conv1/Conv2D,ResNet18v2/Relu_5,ResNet18v2/group2/block0/conv2/Conv2D,ResNet18v2/stage4_add1,ResNet18v2/stage4_relu1,ResNet18v2/group2/block1/conv1/Conv2D,ResNet18v2/Relu_6,ResNet18v2/group2/block1/conv2/Conv2D,ResNet18v2/stage4_add2,ResNet18v2/stage4_relu2,ResNet18v2/group3/block0/convshortcut/Conv2D,ResNet18v2/group3/block0/conv1/Conv2D,ResNet18v2/Relu_7,ResNet18v2/group3/block0/conv2/Conv2D,ResNet18v2/stage5_add1,ResNet18v2/stage5_relu1,ResNet18v2/group3/block1/conv1/Conv2D,ResNet18v2/Relu_8,ResNet18v2/group3/block1/conv2/Conv2D,ResNet18v2/stage5_add2,ResNet18v2/stage5_relu2,ResNet18v2/conv_extras1_1x1_s1/Conv2D,ResNet18v2/Relu_9,ResNet18v2/conv_extras2_3x3_s1/Conv2D,ResNet18v2/Relu_10,ResNet18v2/conv_extras3_1x1_s1/Conv2D,ResNet18v2/Relu_11,ResNet18v2/conv_extras4_3x3_s2/Conv2D,ResNet18v2/Relu_12,ResNet18v2/conv_ff3_1x1_s1/Conv2D,ResNet18v2/Relu_15,ResNet18v2/conv_ff2_1x1_s1/Conv2D,ResNet18v2/conv_ff2_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_14,ResNet18v2/conv_ff1_1x1_s1/Conv2D,ResNet18v2/conv_ff1_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_13} 50.41 0.64 1.6
[02/01/2021-18:55:10] [I] ResNet18v2/conv0/Conv2D__27:0 finish 4.90 0.06 0.2
[02/01/2021-18:55:10] [I] {ResNet18v2/conv0/Conv2D,ResNet18v2/Relu,ResNet18v2/stage1_maxpool,ResNet18v2/group0/block0/conv1/Conv2D,ResNet18v2/Relu_1,ResNet18v2/group0/block0/conv2/Conv2D,ResNet18v2/stage2_add1,ResNet18v2/stage2_relu1,ResNet18v2/group0/block1/conv1/Conv2D,ResNet18v2/Relu_2,ResNet18v2/group0/block1/conv2/Conv2D,ResNet18v2/stage2_add2,ResNet18v2/stage2_relu2,ResNet18v2/group1/block0/convshortcut/Conv2D,ResNet18v2/group1/block0/conv1/Conv2D,ResNet18v2/Relu_3,ResNet18v2/group1/block0/conv2/Conv2D,ResNet18v2/stage3_add1,ResNet18v2/stage3_relu1,ResNet18v2/group1/block1/conv1/Conv2D,ResNet18v2/Relu_4,ResNet18v2/group1/block1/conv2/Conv2D,ResNet18v2/stage3_add2,ResNet18v2/stage3_relu2,ResNet18v2/group2/block0/convshortcut/Conv2D,ResNet18v2/group2/block0/conv1/Conv2D,ResNet18v2/Relu_5,ResNet18v2/group2/block0/conv2/Conv2D,ResNet18v2/stage4_add1,ResNet18v2/stage4_relu1,ResNet18v2/group2/block1/conv1/Conv2D,ResNet18v2/Relu_6,ResNet18v2/group2/block1/conv2/Conv2D,ResNet18v2/stage4_add2,ResNet18v2/stage4_relu2,ResNet18v2/group3/block0/convshortcut/Conv2D,ResNet18v2/group3/block0/conv1/Conv2D,ResNet18v2/Relu_7,ResNet18v2/group3/block0/conv2/Conv2D,ResNet18v2/stage5_add1,ResNet18v2/stage5_relu1,ResNet18v2/group3/block1/conv1/Conv2D,ResNet18v2/Relu_8,ResNet18v2/group3/block1/conv2/Conv2D,ResNet18v2/stage5_add2,ResNet18v2/stage5_relu2,ResNet18v2/conv_extras1_1x1_s1/Conv2D,ResNet18v2/Relu_9,ResNet18v2/conv_extras2_3x3_s1/Conv2D,ResNet18v2/Relu_10,ResNet18v2/conv_extras3_1x1_s1/Conv2D,ResNet18v2/Relu_11,ResNet18v2/conv_extras4_3x3_s2/Conv2D,ResNet18v2/Relu_12,ResNet18v2/conv_ff3_1x1_s1/Conv2D,ResNet18v2/Relu_15,ResNet18v2/conv_ff2_1x1_s1/Conv2D,ResNet18v2/conv_ff2_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_14,ResNet18v2/conv_ff1_1x1_s1/Conv2D,ResNet18v2/conv_ff1_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_13} output reformatter 1 2504.86 31.71 77.3
[02/01/2021-18:55:10] [I] {ResNet18v2/conv0/Conv2D,ResNet18v2/Relu,ResNet18v2/stage1_maxpool,ResNet18v2/group0/block0/conv1/Conv2D,ResNet18v2/Relu_1,ResNet18v2/group0/block0/conv2/Conv2D,ResNet18v2/stage2_add1,ResNet18v2/stage2_relu1,ResNet18v2/group0/block1/conv1/Conv2D,ResNet18v2/Relu_2,ResNet18v2/group0/block1/conv2/Conv2D,ResNet18v2/stage2_add2,ResNet18v2/stage2_relu2,ResNet18v2/group1/block0/convshortcut/Conv2D,ResNet18v2/group1/block0/conv1/Conv2D,ResNet18v2/Relu_3,ResNet18v2/group1/block0/conv2/Conv2D,ResNet18v2/stage3_add1,ResNet18v2/stage3_relu1,ResNet18v2/group1/block1/conv1/Conv2D,ResNet18v2/Relu_4,ResNet18v2/group1/block1/conv2/Conv2D,ResNet18v2/stage3_add2,ResNet18v2/stage3_relu2,ResNet18v2/group2/block0/convshortcut/Conv2D,ResNet18v2/group2/block0/conv1/Conv2D,ResNet18v2/Relu_5,ResNet18v2/group2/block0/conv2/Conv2D,ResNet18v2/stage4_add1,ResNet18v2/stage4_relu1,ResNet18v2/group2/block1/conv1/Conv2D,ResNet18v2/Relu_6,ResNet18v2/group2/block1/conv2/Conv2D,ResNet18v2/stage4_add2,ResNet18v2/stage4_relu2,ResNet18v2/group3/block0/convshortcut/Conv2D,ResNet18v2/group3/block0/conv1/Conv2D,ResNet18v2/Relu_7,ResNet18v2/group3/block0/conv2/Conv2D,ResNet18v2/stage5_add1,ResNet18v2/stage5_relu1,ResNet18v2/group3/block1/conv1/Conv2D,ResNet18v2/Relu_8,ResNet18v2/group3/block1/conv2/Conv2D,ResNet18v2/stage5_add2,ResNet18v2/stage5_relu2,ResNet18v2/conv_extras1_1x1_s1/Conv2D,ResNet18v2/Relu_9,ResNet18v2/conv_extras2_3x3_s1/Conv2D,ResNet18v2/Relu_10,ResNet18v2/conv_extras3_1x1_s1/Conv2D,ResNet18v2/Relu_11,ResNet18v2/conv_extras4_3x3_s2/Conv2D,ResNet18v2/Relu_12,ResNet18v2/conv_ff3_1x1_s1/Conv2D,ResNet18v2/Relu_15,ResNet18v2/conv_ff2_1x1_s1/Conv2D,ResNet18v2/conv_ff2_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_14,ResNet18v2/conv_ff1_1x1_s1/Conv2D,ResNet18v2/conv_ff1_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_13} output to be reformatted 1 finish 0.46 0.01 0.0

[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} 0.17 0.00 0.0
[02/01/2021-18:55:10] [I] ResNet18v2/feature_transform_module:0 finish 0.05 0.00 0.0
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output reformatter 8 553.06 7.00 17.1
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output to be reformatted 8 finish 0.42 0.01 0.0
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output reformatter 5 0.50 0.01 0.0
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output to be reformatted 5 finish 0.17 0.00 0.0
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output reformatter 1 0.39 0.00 0.0
[02/01/2021-18:55:10] [I] {ResNet18v2/conv_pfe1_3x3_s1/Conv2D,ResNet18v2/Relu_16,ResNet18v2/conv_pfe2_3x3_s2/Conv2D,ResNet18v2/Relu_17,ResNet18v2/conv_pfe3_3x3_s2/Conv2D,ResNet18v2/Relu_18,ResNet18v2/conv_pfe4_3x3_s2/Conv2D,ResNet18v2/Relu_19,ResNet18v2/conv_pfe5_3x3_s2/Conv2D,ResNet18v2/Relu_20,ResNet18v2/conv_pfe6_3x3_s2/Conv2D,ResNet18v2/Relu_21,ResNet18v2/conv_loc5_3x3_s1/Conv2D,ResNet18v2/conv_conf5_3x3_s1/Conv2D,ResNet18v2/conv_loc4_3x3_s1/Conv2D,ResNet18v2/conv_conf4_3x3_s1/Conv2D,ResNet18v2/conv_loc3_3x3_s1/Conv2D,ResNet18v2/conv_conf3_3x3_s1/Conv2D,ResNet18v2/conv_loc2_3x3_s1/Conv2D,ResNet18v2/conv_conf2_3x3_s1/Conv2D,ResNet18v2/conv_loc1_3x3_s1/Conv2D,ResNet18v2/conv_conf1_3x3_s1/Conv2D,ResNet18v2/conv_loc0_3x3_s1/Conv2D,ResNet18v2/conv_conf0_3x3_s1/Conv2D} output to be reformatted 1 finish 0.17 0.00 0.0

Blockquote

thanks
Eyal

Hi,

These are Reformat layers, using for data type conversion.
You can find TensorRT document for how to use reformat-free inference:

Thanks.

Hi @AastaLLL,
I just stumbled on this sample this morning myself :)
A couple of questions though please:

  • So basically 94% of my network time is due to reformatting and not actual layer business logic?
  • All because of ASIL??
  • This is unique to working with the DLA? its not happening on the GPU?
  • I don’t understand something… my network input is FP32 but when I compile to .plan I use --fp16.
    So internally NVIDIA changes it to FP16… how will that work with reformat-free inference?
    How will that work if I use the INT8 dynamic range?
  • The sample code only runs on the GPU and not DLA?
  • And finally, suppose it all works correctly with reformat-free inference, should I expect massive performance gain here on the DLA?

thanks
Eyal

Hi @AastaLLL ,
Another follow up question, if I may.

I’d like to run my network, which includes a Bilinear resize, which falls back to the GPU, on the DLA.
I have this code in-place, but I still see the reformats… any idea what could be wrong?

Blockquote
config->setFlag(nvinfer1::BuilderFlag::kFP16);
config->setFlag(nvinfer1::BuilderFlag::kGPU_FALLBACK);
config->setDefaultDeviceType(nvinfer1::DeviceType::kDLA);
config->setDLACore(0);
config->setFlag(nvinfer1::BuilderFlag::kSTRICT_TYPES);
network->getInput(0)->setAllowedFormats(static_castnvinfer1::TensorFormats(1 << static_cast(nvinfer1::TensorFormat::kLINEAR)));
for (int i = 0; i < nbOutputs; i++)
network->getOutput(i)->setAllowedFormats(static_castnvinfer1::TensorFormats(1 << static_cast(nvinfer1::TensorFormat::kLINEAR)));

engine = builder->buildEngineWithConfig(*network, *config);
context = engine->createExecutionContext();

Blockquote

Thanks
Eyal

Hi @AastaLLL ,
I’ve been playing with the network, cutting it at all sorts of places and evaluating the performance as the network grows…
I believe the reformat-layers are not the reason for the 77.3% and the overhead in the dump I’ve provided in the first post.
Here’s a profile output for the first/simple portion of the network (after I’ve cut it). I doubt the 91.9% of the time is because of the “ResNet18v2/Relu:0 from nvm” layer… but rather information is not calculated correctly?

Blockquote
=== Profile (100 iterations ) ===
Layer Time (ms) Avg. Time (ms) Time %
(Unnamed Layer* 3) [Constant] + (Unnamed Layer* 4) [Shuffle] 0.67 0.01 0.1
(Unnamed Layer* 6) [Constant] + (Unnamed Layer* 7) [Shuffle] 0.37 0.00 0.0
{scale_operand_of_ResNet18v2/image_preprocess/sub} input reformatter 0 5.12 0.05 0.4
{scale_operand_of_ResNet18v2/image_preprocess/sub} 30.03 0.30 2.5
{scale_operand_of_ResNet18v2/image_preprocess/sub} reformatted input 0 finish 5.76 0.06 0.5
{scale_operand_of_ResNet18v2/image_preprocess/sub} output reformatter 0 10.44 0.10 0.9
{scale_operand_of_ResNet18v2/image_preprocess/sub} output to be reformatted 0 finish 2.85 0.03 0.2
PWN(PWN((Unnamed Layer* 0) [Constant] + (Unnamed Layer* 1) [Shuffle] + ResNet18v2/image_preprocess/mul, ResNet18v2/image_preprocess/sub), ResNet18v2/image_preprocess/truediv) input reformatter 0 7.79 0.08 0.6
PWN(PWN((Unnamed Layer* 0) [Constant] + (Unnamed Layer* 1) [Shuffle] + ResNet18v2/image_preprocess/mul, ResNet18v2/image_preprocess/sub), ResNet18v2/image_preprocess/truediv) input reformatter 2 0.58 0.01 0.0
PWN(PWN((Unnamed Layer* 0) [Constant] + (Unnamed Layer* 1) [Shuffle] + ResNet18v2/image_preprocess/mul, ResNet18v2/image_preprocess/sub), ResNet18v2/image_preprocess/truediv) 6.29 0.06 0.5
ResNet18v2/conv0/Conv2D__27 input reformatter 0 3.57 0.04 0.3
ResNet18v2/conv0/Conv2D__27 6.84 0.07 0.6
ResNet18v2/conv0/Conv2D__27 output reformatter 0 10.21 0.10 0.8
{ResNet18v2/conv0/Conv2D,ResNet18v2/Relu} 4.90 0.05 0.4
ResNet18v2/conv0/Conv2D__27:0 finish 1.97 0.02 0.2
ResNet18v2/Relu:0 from nvm 1113.93 11.14 91.9
ResNet18v2/Relu:0 copy finish 0.38 0.00 0.0
Total 1211.70 12.12 100.0

Hi,

Sorry that my previous statement is not clear enough.

The “output reformatter XX” indicates a reformat layer.
But please noted that TensorRT will combine and merge layers before submitting to the GPU.
So you may see the “output reformatter” but it indicates several layers that be combined together.

For example:

{ResNet18v2/conv0/Conv2D,ResNet18v2/Relu,ResNet18v2/stage1_maxpool,ResNet18v2/group0/block0/conv1/Conv2D,ResNet18v2/Relu_1,ResNet18v2/group0/block0/conv2/Conv2D,ResNet18v2/stage2_add1,ResNet18v2/stage2_relu1,ResNet18v2/group0/block1/conv1/Conv2D,ResNet18v2/Relu_2,ResNet18v2/group0/block1/conv2/Conv2D,ResNet18v2/stage2_add2,ResNet18v2/stage2_relu2,ResNet18v2/group1/block0/convshortcut/Conv2D,ResNet18v2/group1/block0/conv1/Conv2D,ResNet18v2/Relu_3,ResNet18v2/group1/block0/conv2/Conv2D,ResNet18v2/stage3_add1,ResNet18v2/stage3_relu1,ResNet18v2/group1/block1/conv1/Conv2D,ResNet18v2/Relu_4,ResNet18v2/group1/block1/conv2/Conv2D,ResNet18v2/stage3_add2,ResNet18v2/stage3_relu2,ResNet18v2/group2/block0/convshortcut/Conv2D,ResNet18v2/group2/block0/conv1/Conv2D,ResNet18v2/Relu_5,ResNet18v2/group2/block0/conv2/Conv2D,ResNet18v2/stage4_add1,ResNet18v2/stage4_relu1,ResNet18v2/group2/block1/conv1/Conv2D,ResNet18v2/Relu_6,ResNet18v2/group2/block1/conv2/Conv2D,ResNet18v2/stage4_add2,ResNet18v2/stage4_relu2,ResNet18v2/group3/block0/convshortcut/Conv2D,ResNet18v2/group3/block0/conv1/Conv2D,ResNet18v2/Relu_7,ResNet18v2/group3/block0/conv2/Conv2D,ResNet18v2/stage5_add1,ResNet18v2/stage5_relu1,ResNet18v2/group3/block1/conv1/Conv2D,ResNet18v2/Relu_8,ResNet18v2/group3/block1/conv2/Conv2D,ResNet18v2/stage5_add2,ResNet18v2/stage5_relu2,ResNet18v2/conv_extras1_1x1_s1/Conv2D,ResNet18v2/Relu_9,ResNet18v2/conv_extras2_3x3_s1/Conv2D,ResNet18v2/Relu_10,ResNet18v2/conv_extras3_1x1_s1/Conv2D,ResNet18v2/Relu_11,ResNet18v2/conv_extras4_3x3_s2/Conv2D,ResNet18v2/Relu_12,ResNet18v2/conv_ff3_1x1_s1/Conv2D,ResNet18v2/Relu_15,ResNet18v2/conv_ff2_1x1_s1/Conv2D,ResNet18v2/conv_ff2_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_14,ResNet18v2/conv_ff1_1x1_s1/Conv2D,ResNet18v2/conv_ff1_1x1_s1_bn/FusedBatchNormV3,ResNet18v2/Relu_13} output reformatter

Thanks.

Hi @AastaLLL ,
Yes that confirms what I see. So in that case, the profiler can not show me which of the merged layers is responsible for most of the time, such that I might be able to optimize it, right?
Any other way to get this information?

thanks
Eyal

Hi,

For layer level profiling, you can use TensorRT it’s own profiler:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-713/api/c_api/classnvinfer1_1_1_i_profiler.html

Thanks.

Thanks @AastaLLL , I believe the performance log in the first post in this thread (using trtexec) is using TRT’s profiler and for the DLA it joins the layers, in such way that there’s no way to know which layer is the most time consuming.

thanks
Eyal