I am also baffled by why this layer takes 26ms.
I’ve added some prints after the buildCudaEngine process and during the inference using the profiler, to shed some light into this issue.
Any suggestions as to why this layer takes so much time? It doesn’t exist when the engine is configured to the GPU.
Blockquote
Layer [57]: [(Unnamed Layer* 69) [Shuffle] input reformatter 0]: 26.9848ms
Default DLA is enabled but layer (Unnamed Layer* 69) [Shuffle] is not running on DLA, falling back to GPU.
Adding reformat layer: (Unnamed Layer* 69) [Shuffle] reformatted input 0 (3x3_s1/Conv2D_raw_output___5:0) from Half(1,64,4096:16,8192) to Float(1,64,4096,131072)
For layer (Unnamed Layer* 69) [Shuffle] a higher-precision implementation was chosen than was requested because it resulted in faster network performance
68: [(Unnamed Layer* 68) [Convolution]], type: kCONVOLUTION, precision: kFLOAT, inputs: 1, outputs: 1
Convolution: Dims: 3 x 3, getNbOutputMaps: 32, Stride: 1x1, Padding: 1x1, Dilation: 1x1
Input tensor: R/Relu_16:0, kFLOAT, Dims: 3[512, 64, 64]
Output tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
69: [(Unnamed Layer* 69) [Shuffle]], type: kSHUFFLE, precision: kFLOAT, inputs: 1, outputs: 1
Shuffle:
First transpose: [1, 2, 0, 3, 4, 5, 6, 1, ]
Second transpose: [0, 1, 2, 3, 4, 5, 6, 7, ]
Reshape:
Input tensor: 3x3_s1/Conv2D_raw_output___5:0, kFLOAT, Dims: 3[32, 64, 64]
Output tensor: 3x3_s1/Conv2D:0, kFLOAT, Dims: 3[64, 64, 32]
Blockquote