Provide details on the platforms you are using: Target Device: NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS
4.9.140-tegra
Development Platform: Linux distro and version: Ubuntu 16.04
GPU: Nvidia RTX2070
nvidia driver 418.88
CUDA 10.0
CUDNN 7
Python 3.5
Tensorflow-gpu 1.14.0
TensorRT 5.1.5.0
UFF Version 0.6.3
Hi all,
We have ported a TensorFlow.pb model into TensorRT and have run it on a Jetson TX2 P3310.
However, the model inference speed is slow despite the model being relatively small. Model Architecture Here
On a unit test (without profiling), our model performs an inference on a single 64x64 image on a single stream, the model takes ~3ms.
In comparison, a ResNet18 model takes ~5ms for inference on a single 224x224 image.
We expected the inference on our model to be much faster for the following reasons:
Both models are converted as FP16 models and have pooling layers in between convolutional layers
Our model is also much smaller, having only 6 convolutional layers whereas ResNet18 has at least 17 convolutional layers
The input size for our model is 64x64, which is 12.25 times smaller than that of the ResNet18 model which takes in an 224x224 image.
Despite the size advantages that our model has, we only see that it infers around half as fast as ResNet18.
We tried to identify the issue in our model by using the nvidia visual profiler where we imported .prof files generated with the following command:
nvprof -timeline.prof ./model/unit_test
In the visual profiler, the timeline shows that the graph computations for our model occurs much more sparsely as compared to that of a ResNet18 model. Click for comparison and individual timelines
We think that the unused time between computations is the main culprit for the increased inference time for our model. However we’re not sure what could be causing the sparse computations and any help rendered would be greatly appreciated!
While this thread was moved to the TX2 forums, we’d like to point out that the computations still occur sparsely when the model is run on host although not as severely.
Here is our comparison between the profiler timelines for the same model performing inference on a single image but on different devices.
The images of the profiler timelines have also been attached to this reply.
Thanks for your reply. Here is the information you requested:
We conducted our experimentation with sudo jetson_clocks but not with sudo nvpmodel -m 0
Here is the timeline from applying both commands prior to profiling. (available as an attachment as well). It seems the issue still persists despite maximizing device performance.
We are using pure TensorRT.
All layers converted to TensorRT FP16 from .uff without any issues. Our unit tests also show that the output of the model in TensorRT is correct.
We also conducted additional testing where we increased our kernel sizes two fold. The computations for each kernel take longer (which makes sense), decreasing the gaps between computations. However what is peculiar is that the total time for inference remains about the same which perhaps implies that the bottleneck is kernel launch latency. What we’d really want is for the kernels to execute back to back.
Thanks for your experiment.
It looks like the issue is from activation_N/Tanh layer.
In other operation, TX2 run around 2x slower than the host but 7x slower on the Tanh operation.
We are checking this issue with our internal team. Will update more information with you later.
The activation_N/Tanh layers certainly are contributing to the inference time.
However we don’t think they can fully account for the gaps in kernel activity that we mentioned in our original post which trtexec does not show.
(post quoted below)
If the activation_N/Tanh layers were indeed the sole culprit, we believe that there should be a proportional increase in inference time when kernel size is increased. However in one of the experiments that we shared here, an increase in kernel size did not yield a proportional increase in inference time.
(post quoted below)
This perhaps indicates there is another factor contributing to the gaps between computations.
Thanks for taking the time to work on this issue.
If you need any additional information or data, do let us know! :-)
We did a bit a further experimentation, this time varying the inference batch size when running on the TX2.
While some of the gaps between kernels are filled up, it seems there’s still a significant amount of inactivity between kernels.
Also note that the inference times for our model for batch sizes 1, 2, 4, 8 and 16 remain around 8ms.
Sorry I for popping in the dicussion here. I am also facing similar issue as what Evan described, but with different model. Hope you can help us with the issue too.
The batchsize experiment is helpful.
When you building TensorRT engine from uff file, there is a parameter to specify the maximal batchsize.
May I know the value you set?
builder->setMaxBatchSize(maxBatchSize);
By the way, we also meet some TensorRT performance issue from other users.
Do you run any CUDA at the same time?
Or could you try to execute TensorRT with –useSpinWait option for us?
What is the Maximal Batchsize Set?
The value of maxBatchSize in the following line of code is 16
builder->setMaxBatchSize(maxBatchSize);
Are there any other CUDA processes being run at the same time as the inference?
No, other than the inference, we are not running any other CUDA processes.
Output of TensorRT(trtexec) with --useSpinWait option
We ran trtexec on the TX2 with the --useSpinWait option for both SSRNet (which is the model that exhibits the problem) and ResNet. The txtfile output for these have been attached to the post.
Thanks for your patience. We have clarified this issue now.
Actually, this is not an issue.
Please noticed that there are two steps to launch TensorRT from a uff model.
Compile the uff file into TensorRT engine
Inference the TensorRT engine
The gap you observed between each kernel occurs when the compiling time.
In this step, TensorRT will evaluate kernel’s runtime and choose a fast one so the large gap between kernel is expected.
To profile the TensorRT inference time, please use the profiling data close to the end.
We can see only 0.05 ms for the max gap.