Provide details on the platforms you are using:
NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS
Linux distro and version: Ubuntu 16.04.5 LTS
GPU: Nvidia GTX1060
nvidia driver 410.78
UFF Version 0.6.3
Hello. We have converted our TF model to TensorRT using pure TensorRT and observed that the inferencing time is slow compared to bigger ResNet18 model. We did a unit test and compared both the models :
Our model (Refinement net) :
(3 convolutions with PRelu activation, 3 Dense layers)
Input size : 24 x 24
Batch size : 1
Inference time : 3.244ms
Input size : 224 x 224
Batch size : 1
Inference time : ~5ms
Judging from our input size which is about 10 times smaller than the ResNet18’s, we expect a much shorter inferencing time. However, the result shows the inferencing time of our model is more than half of ResNet18’s. This doesn’t seem correct to me.
We then took a step further to profile the 2 models. Upon inspection using nvprof, we noticed that there are gaps in between processes in our model which are not being utilized. Each of the processes seems to stall before the next process kicks in.
sudo nvprof -f -o profile_res.prof ./model/unit_test
From the profiling result, we noticed that there are huge gaps in between each kernel processes for our model. There was no such occurrence for ResNet18. We are guessing that longer inferencing time is caused by these gaps. However, we have no explanation to the gaps we are seeing in our profiled model.
https://imgur.com/EnCUmU7 nvprof result of ResNet18 with image size 224x224
https://imgur.com/z0cdqmX nvprof result of our model with image size 24x24
https://imgur.com/f9yiMVz Comparison of nvprof on ResNet18 and our model
https://imgur.com/ENobQ3b Our model’s illustration
Then, we tried to increase our batch size from 1 to 16 and observed that the gaps are still there, but the time taken for the convolution layer went from 21us to 65us (3 times longer).
https://imgur.com/s3g6kwV Comparison of nvprof on our model for batch size 1 and 16
With the above observations, we have the following questions for our model :
Each kernel stalled for a period of time before executing the next one, creating gaps. Why are there gaps in between kernel processes?
How can we eliminate these gaps?
Is it true that TensorRT only works better with bigger networks and not the smaller ones? Why is it so?
How does TensorRT handles convolution layer? We observed the following examples:
- for convolution output = 28, trt_maxwell_cudnn_…128x32… was used
- for convolution output = 64, trt_maxwell_cudnn_…128x64… was used
- for convolution output = 72, trt_maxwell_cudnn_…128x128… was used
How does bigger batch size affects processing time in TensorRT ? From the observations above, increasing batch size 1 to 16 increases the convolution processing time by factor of 3.
- All the above results were obtained by setting nvpmodel to MAXN(0)
- All layers were converted to TensorRT FP16 from .uff without any error. Tested with unit test and results are correct