Provide details on the platforms you are using: Target Device: NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS
Development Platform: Linux distro and version: Ubuntu 16.04
GPU: Nvidia RTX2070
nvidia driver 418.88
UFF Version 0.6.3
We have ported a TensorFlow.pb model into TensorRT and have run it on a Jetson TX2 P3310.
However, the model inference speed is slow despite the model being relatively small. Model Architecture Here
On a unit test (without profiling), our model performs an inference on a single 64x64 image on a single stream, the model takes ~3ms.
In comparison, a ResNet18 model takes ~5ms for inference on a single 224x224 image.
We expected the inference on our model to be much faster for the following reasons:
Both models are converted as FP16 models and have pooling layers in between convolutional layers
Our model is also much smaller, having only 6 convolutional layers whereas ResNet18 has at least 17 convolutional layers
The input size for our model is 64x64, which is 12.25 times smaller than that of the ResNet18 model which takes in an 224x224 image.
Despite the size advantages that our model has, we only see that it infers around half as fast as ResNet18.
We tried to identify the issue in our model by using the nvidia visual profiler where we imported .prof files generated with the following command:
We think that the unused time between computations is the main culprit for the increased inference time for our model. However we’re not sure what could be causing the sparse computations and any help rendered would be greatly appreciated!
All layers converted to TensorRT FP16 from .uff without any issues. Our unit tests also show that the output of the model in TensorRT is correct.
We also conducted additional testing where we increased our kernel sizes two fold. The computations for each kernel take longer (which makes sense), decreasing the gaps between computations. However what is peculiar is that the total time for inference remains about the same which perhaps implies that the bottleneck is kernel launch latency. What we’d really want is for the kernels to execute back to back.
The activation_N/Tanh layers certainly are contributing to the inference time.
However we don’t think they can fully account for the gaps in kernel activity that we mentioned in our original post which trtexec does not show.
(post quoted below)
If the activation_N/Tanh layers were indeed the sole culprit, we believe that there should be a proportional increase in inference time when kernel size is increased. However in one of the experiments that we shared here, an increase in kernel size did not yield a proportional increase in inference time.
(post quoted below)
This perhaps indicates there is another factor contributing to the gaps between computations.
Thanks for taking the time to work on this issue.
If you need any additional information or data, do let us know! :-)
We did a bit a further experimentation, this time varying the inference batch size when running on the TX2.
While some of the gaps between kernels are filled up, it seems there’s still a significant amount of inactivity between kernels.
Also note that the inference times for our model for batch sizes 1, 2, 4, 8 and 16 remain around 8ms.
What is the Maximal Batchsize Set?
The value of maxBatchSize in the following line of code is 16
Are there any other CUDA processes being run at the same time as the inference?
No, other than the inference, we are not running any other CUDA processes.
Output of TensorRT(trtexec) with --useSpinWait option
We ran trtexec on the TX2 with the --useSpinWait option for both SSRNet (which is the model that exhibits the problem) and ResNet. The txtfile output for these have been attached to the post.