Slow inferencing on TensorRT with gaps in between processes

Provide details on the platforms you are using:
Target Device:
NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS

Development Platform:
Linux distro and version: Ubuntu 16.04.5 LTS
GPU: Nvidia GTX1060
nvidia driver 410.78
CUDA 10.0
Python 3.5.2
Tensorflow-gpu 1.14.0
UFF Version 0.6.3

Hello. We have converted our TF model to TensorRT using pure TensorRT and observed that the inferencing time is slow compared to bigger ResNet18 model. We did a unit test and compared both the models :

Our model (Refinement net) :
(3 convolutions with PRelu activation, 3 Dense layers)
Input size : 24 x 24
Batch size : 1
Inference time : 3.244ms

ResNet18 :
Input size : 224 x 224
Batch size : 1
Inference time : ~5ms

Judging from our input size which is about 10 times smaller than the ResNet18’s, we expect a much shorter inferencing time. However, the result shows the inferencing time of our model is more than half of ResNet18’s. This doesn’t seem correct to me.

We then took a step further to profile the 2 models. Upon inspection using nvprof, we noticed that there are gaps in between processes in our model which are not being utilized. Each of the processes seems to stall before the next process kicks in.

sudo nvprof -f -o ./model/unit_test

From the profiling result, we noticed that there are huge gaps in between each kernel processes for our model. There was no such occurrence for ResNet18. We are guessing that longer inferencing time is caused by these gaps. However, we have no explanation to the gaps we are seeing in our profiled model. nvprof result of ResNet18 with image size 224x224 nvprof result of our model with image size 24x24 Comparison of nvprof on ResNet18 and our model Our model’s illustration

Then, we tried to increase our batch size from 1 to 16 and observed that the gaps are still there, but the time taken for the convolution layer went from 21us to 65us (3 times longer). Comparison of nvprof on our model for batch size 1 and 16

With the above observations, we have the following questions for our model :

  1. Each kernel stalled for a period of time before executing the next one, creating gaps. Why are there gaps in between kernel processes?

  2. How can we eliminate these gaps?

  3. Is it true that TensorRT only works better with bigger networks and not the smaller ones? Why is it so?

  4. How does TensorRT handles convolution layer? We observed the following examples:

    • for convolution output = 28, trt_maxwell_cudnn_…128x32… was used
    • for convolution output = 64, trt_maxwell_cudnn_…128x64… was used
    • for convolution output = 72, trt_maxwell_cudnn_…128x128… was used
  5. How does bigger batch size affects processing time in TensorRT ? From the observations above, increasing batch size 1 to 16 increases the convolution processing time by factor of 3.

Notes :

  • All the above results were obtained by setting nvpmodel to MAXN(0)
  • All layers were converted to TensorRT FP16 from .uff without any error. Tested with unit test and results are correct




We are reviewing and will keep you updated.

To help us debug, can you share your model? You can dm me with a link to google drive or dropbox.


Sure. I have sent you my UFF file through dm.

Per engineering, the gap is caused by TRT CPU overhead which is composed by kernel launching, data transfering, etc. When the GPU workload is big enough, the CPU overhead would be covered. However, in the customer’s case, their input is so small that the GPU can finish the kernel execution quickly. That means the GPU has to wait the CPU preparing the next job. This becomes the CPU bounding issue.

Sorry this came late, but thanks NVES for your help and answer. It helped us to confirm on the root cause.