Low Compute utilization of converted TensorFlow model during inference

evan.tkchong · October 3, 2019, 10:22am

Provide details on the platforms you are using:
Target Device:
NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS
4.9.140-tegra

Development Platform:
Linux distro and version: Ubuntu 16.04
GPU: Nvidia RTX2070
nvidia driver 418.88
CUDA 10.0
CUDNN 7
Python 3.5
Tensorflow-gpu 1.14.0
TensorRT 5.1.5.0
UFF Version 0.6.3

Hi all,

We have ported a TensorFlow.pb model into TensorRT and have run it on a Jetson TX2 P3310.
However, the model inference speed is slow despite the model being relatively small. Model Architecture Here

On a unit test (without profiling), our model performs an inference on a single 64x64 image on a single stream, the model takes ~3ms.
In comparison, a ResNet18 model takes ~5ms for inference on a single 224x224 image.

We expected the inference on our model to be much faster for the following reasons:

Both models are converted as FP16 models and have pooling layers in between convolutional layers
Our model is also much smaller, having only 6 convolutional layers whereas ResNet18 has at least 17 convolutional layers
The input size for our model is 64x64, which is 12.25 times smaller than that of the ResNet18 model which takes in an 224x224 image.

Despite the size advantages that our model has, we only see that it infers around half as fast as ResNet18.

We tried to identify the issue in our model by using the nvidia visual profiler where we imported .prof files generated with the following command:

nvprof -timeline.prof ./model/unit_test

In the visual profiler, the timeline shows that the graph computations for our model occurs much more sparsely as compared to that of a ResNet18 model. Click for comparison and individual timelines

We think that the unused time between computations is the main culprit for the increased inference time for our model. However we’re not sure what could be causing the sparse computations and any help rendered would be greatly appreciated!

Best Regards,
tkchong

evan.tkchong · October 7, 2019, 3:05am

Hello,

While this thread was moved to the TX2 forums, we’d like to point out that the computations still occur sparsely when the model is run on host although not as severely.

Here is our comparison between the profiler timelines for the same model performing inference on a single image but on different devices.
The images of the profiler timelines have also been attached to this reply.

Thank you for your time.

AastaLLL · October 7, 2019, 10:10am

Hi,

Thanks for your report. We need some information to give further suggestion.

1. Have you maximized the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks

2. Suppose you are using TF-TRT rather than pure TensorRT, is it correct?

3. Could you help to check if all the layer inside your model are natively support by TensorRT?
This information can be found from the log message.

Thanks.

evan.tkchong · October 8, 2019, 4:17am

Hi,

Thanks for your reply. Here is the information you requested:

We conducted our experimentation with sudo jetson_clocks but not with sudo nvpmodel -m 0
Here is the timeline from applying both commands prior to profiling. (available as an attachment as well). It seems the issue still persists despite maximizing device performance.
We are using pure TensorRT.
All layers converted to TensorRT FP16 from .uff without any issues. Our unit tests also show that the output of the model in TensorRT is correct.

We also conducted additional testing where we increased our kernel sizes two fold. The computations for each kernel take longer (which makes sense), decreasing the gaps between computations. However what is peculiar is that the total time for inference remains about the same which perhaps implies that the bottleneck is kernel launch latency. What we’d really want is for the kernels to execute back to back.

AastaLLL · October 9, 2019, 3:08am

Hi,

It looks weird to me. The layers in your model should run very well with pure TensorRT.

Could you help us do another experiment?
We want to have a layer-level profiling result of TensorRT.

Here are the steps:

$ cp -r /usr/src/tensorrt/ .
$ cd tensorrt/samples/trtexec

Apply this patch

diff --git a/trtexec.cpp b/trtexec.cpp
index 95d01fb..449e8f4 100644
--- a/trtexec.cpp
+++ b/trtexec.cpp
@@ -143,6 +143,34 @@ float percentile(float percentage, std::vector<float>& times)
     return std::numeric_limits<float>::infinity();
 }
 
+struct Profiler : public IProfiler
+{
+    typedef std::pair<std::string, float> Record;
+    std::vector<Record> mProfile;
+
+    virtual void reportLayerTime(const char* layerName, float ms)
+    {
+        auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });
+        if (record == mProfile.end())
+            mProfile.push_back(std::make_pair(layerName, ms));
+        else
+            record->second += ms;
+    }
+
+    void printLayerTimes()
+    {
+        float totalTime = 0;
+        for (size_t i = 0; i < mProfile.size(); i++)
+        {
+            printf("%-40.40s %4.3fms\n", mProfile[i].first.c_str(), mProfile[i].second / gParams.avgRuns);
+            totalTime += mProfile[i].second;
+        }
+        printf("Time over all layers: %4.3f\n", totalTime / gParams.avgRuns);
+    }
+
+} gProfiler;
+
+
 class RndInt8Calibrator : public IInt8EntropyCalibrator2
 {
 public:
@@ -405,6 +433,7 @@ ICudaEngine* onnxToTRTModel()
 void doInference(ICudaEngine& engine)
 {
     IExecutionContext* context = engine.createExecutionContext();
+    context->setProfiler(&gProfiler);
 
     // Use an aliasing shared_ptr since we don't want engine to be deleted when bufferManager goes out of scope.
     std::shared_ptr<ICudaEngine> emptyPtr{};
@@ -438,10 +467,12 @@ void doInference(ICudaEngine& engine)
             times[i] = ms;
             totalGpu += ms;
         }
+        gProfiler.printLayerTimes();
         totalGpu /= gParams.avgRuns;
         totalHost /= gParams.avgRuns;
         gLogInfo << "Average over " << gParams.avgRuns << " runs is " << totalGpu << " ms (host walltime is " << totalHost
                  << " ms, " << static_cast<int>(gParams.pct) << "\% percentile time is " << percentile(gParams.pct, times) << ")." << std::endl;
+
     }
 
     if (gParams.dumpOutput)

$ make
$ cd ../../bin
$ ./trtexec xxxx

Thanks.

evan.tkchong · October 9, 2019, 12:16pm

Hello, thanks for the reply!

We have run trtexec with our model on both our host system and the TX2.
We ran the executable with our .uff model file with minimal flags as such:

./trtexec --uff=/path/to/model.uff --uffInput=input_1,3,64,64 --output=ssr_function/mul_9

With input_1 and ssr_function/mul_9 corresponding to the input and output tensors of the model.

The results were piped to .txt files and have been attached as the following:

Output on Host: trtexec_ssrnet_host.txt
Output on TX2: trtexec_ssrnet_tx2.txt

We have also provided our model as a .pbtxt (as a .txt file because of the forum attachment limitations) file without loaded weights for your reference.
trtexec_ssrnet_host.txt (59.2 KB)
trtexec_ssrnet_tx2.txt (59.3 KB)
ssrnet_nchw(pbtxt).txt (150 KB)

evan.tkchong · October 11, 2019, 8:32am

Hello, for additional reference, we are attaching the output for running trtexec with a resnet on both out host system and the TX2.

Output on Host: trtexec_resnetdream_host.txt
Output on TX2: trtexec_resnetdream_TX2.txt

The input command is as follows:

./trtexec --uff=/path/to/resnet/model.uff --uffInput=input_img,3,224,224 --uffInput=input_yaw,1,1,1 --output=dream.add/add

trtexec_resnetdream_host.txt (18.6 KB)
trtexec_resnetdream_tx2.txt (18.7 KB)

AastaLLL · October 14, 2019, 6:13am

Hi,

Thanks for your experiment.
It looks like the issue is from activation_N/Tanh layer.

In other operation, TX2 run around 2x slower than the host but 7x slower on the Tanh operation.
We are checking this issue with our internal team. Will update more information with you later.

Thanks.

evan.tkchong · October 14, 2019, 7:30am

Hi @AastaLLL,

Thanks for replying to us.

The activation_N/Tanh layers certainly are contributing to the inference time.
However we don’t think they can fully account for the gaps in kernel activity that we mentioned in our original post which trtexec does not show.
(post quoted below)

If the activation_N/Tanh layers were indeed the sole culprit, we believe that there should be a proportional increase in inference time when kernel size is increased. However in one of the experiments that we shared here, an increase in kernel size did not yield a proportional increase in inference time.
(post quoted below)

This perhaps indicates there is another factor contributing to the gaps between computations.

Thanks for taking the time to work on this issue.
If you need any additional information or data, do let us know! :-)

evan.tkchong · October 15, 2019, 2:29am

Hi @AastaLLL,

We did a bit a further experimentation, this time varying the inference batch size when running on the TX2.
While some of the gaps between kernels are filled up, it seems there’s still a significant amount of inactivity between kernels.
Also note that the inference times for our model for batch sizes 1, 2, 4, 8 and 16 remain around 8ms.

Here’s a comparison between the timelines of the model with batchsize=1 and batchsize=16.
We’ve also provided the .prof files for the model for batchsize={1, 2, 4, 8, 16} in the attachments.

ssrnet_tx2_batchsize_exp.zip (2.95 MB)

wsong · October 17, 2019, 7:55am

Sorry I for popping in the dicussion here. I am also facing similar issue as what Evan described, but with different model. Hope you can help us with the issue too.

Link : https://devtalk.nvidia.com/default/topic/1064724/tensorrt/slow-inferencing-on-tensorrt-with-gaps-in-between-processes/

Thanks!

AastaLLL · October 23, 2019, 9:02am

Hi,

Thanks for keeping us updated.

The batchsize experiment is helpful.
When you building TensorRT engine from uff file, there is a parameter to specify the maximal batchsize.
May I know the value you set?

builder->setMaxBatchSize(maxBatchSize);

By the way, we also meet some TensorRT performance issue from other users.
Do you run any CUDA at the same time?

Or could you try to execute TensorRT with –useSpinWait option for us?

./trtexec --uff=/path/to/resnet/model.uff --uffInput=input_img,3,224,224 --uffInput=input_yaw,1,1,1 --output=dream.add/add --useSpinWait

Thanks and please let us know the result.

evan.tkchong · October 23, 2019, 11:13am

Good Evening,

Here are the answers to your queries!

What is the Maximal Batchsize Set? The value of maxBatchSize in the following line of code is 16
```
builder->setMaxBatchSize(maxBatchSize);
```
Are there any other CUDA processes being run at the same time as the inference? No, other than the inference, we are not running any other CUDA processes.
Output of TensorRT(trtexec) with --useSpinWait option We ran trtexec on the TX2 with the --useSpinWait option for both SSRNet (which is the model that exhibits the problem) and ResNet. The txtfile output for these have been attached to the post.

trtexec_resnetdream_useSpinWait.txt (18.7 KB)
trtexec_ssrnet_useSpinWait.txt (59.3 KB)

AastaLLL · October 24, 2019, 5:27am

Hi,

We want to investigate this gap issue more and need to reproduce it on our environment.
Could you share the uff file with us?

Thanks.

evan.tkchong · October 24, 2019, 7:06am

Hi AastaLLL,

Attached is the uff file for SSRNet within the .zip
ssrnet_uff.zip (2.6 MB)

AastaLLL · October 28, 2019, 7:19am

Hi,

It looks like the input/output name is different from the command in comment#7.
Could you share the input/output name of the shared uff file with us?

Thanks.

evan.tkchong · October 29, 2019, 2:05am

Hi AastaLLL,

My apologies.
Please use the following flags for input/output:

--uffInput=input_1,3,64,64 --output=ssr_function/mul_6

AastaLLL · October 30, 2019, 6:36am

Hi,

Thanks for your help. We can reproduce this issue on our side.
It looks like the gap is generated from cudaEventSynchronize function.

We will check this issue with our internal team and update more information with you.

AastaLLL · November 8, 2019, 8:10am

Hi,

Thanks for your patience. We have clarified this issue now.

Actually, this is not an issue.

Please noticed that there are two steps to launch TensorRT from a uff model.

Compile the uff file into TensorRT engine
Inference the TensorRT engine

The gap you observed between each kernel occurs when the compiling time.
In this step, TensorRT will evaluate kernel’s runtime and choose a fast one so the large gap between kernel is expected.

To profile the TensorRT inference time, please use the profiling data close to the end.
We can see only 0.05 ms for the max gap.

Thanks.

Topic		Replies	Views
TensorRT inference Time TensorRT	1	799	September 20, 2018
Inference Time is not stable TensorRT	10	1844	January 3, 2019
inference time of UFF using tensorrt is slower than tensorflow Jetson TX2	9	2833	October 18, 2021
Slow inferencing on TensorRT with gaps in between processes TensorRT	5	1767	January 14, 2020
TensorFlow object detection and image classification accelerated for NVIDIA Jetson Jetson TX2	25	10653	June 3, 2019
Tensorflow not using GPU in Jetson TX2 Jetson TX2	12	4411	October 18, 2021
No improvements from TensorRT on NVIDIA-AI-IOT/tf_trt_models TensorRT	3	1611	February 21, 2019
TensorRT (TF-TRT) doesn't improve TF model in GeForce 1060? TensorRT	7	2999	January 18, 2019
uff inference time large than pb time when process vgg 19 TensorRT	10	1322	December 4, 2018
Model inferencing with TensorRT on Jetson (TX2) Jetson TX2	4	1006	October 18, 2021

Low Compute utilization of converted TensorFlow model during inference

Related topics