Low Compute utilization of converted TensorFlow model during inference

Provide details on the platforms you are using:
Target Device:
NVIDIA Jetson TX2 Module (P3310)
Ubuntu 18.04.2 LTS
4.9.140-tegra

Development Platform:
Linux distro and version: Ubuntu 16.04
GPU: Nvidia RTX2070
nvidia driver 418.88
CUDA 10.0
CUDNN 7
Python 3.5
Tensorflow-gpu 1.14.0
TensorRT 5.1.5.0
UFF Version 0.6.3

Hi all,

We have ported a TensorFlow.pb model into TensorRT and have run it on a Jetson TX2 P3310.
However, the model inference speed is slow despite the model being relatively small. Model Architecture Here

On a unit test (without profiling), our model performs an inference on a single 64x64 image on a single stream, the model takes ~3ms.
In comparison, a ResNet18 model takes ~5ms for inference on a single 224x224 image.

We expected the inference on our model to be much faster for the following reasons:

  • Both models are converted as FP16 models and have pooling layers in between convolutional layers
  • Our model is also much smaller, having only 6 convolutional layers whereas ResNet18 has at least 17 convolutional layers
  • The input size for our model is 64x64, which is 12.25 times smaller than that of the ResNet18 model which takes in an 224x224 image.

Despite the size advantages that our model has, we only see that it infers around half as fast as ResNet18.

We tried to identify the issue in our model by using the nvidia visual profiler where we imported .prof files generated with the following command:

nvprof -timeline.prof ./model/unit_test

In the visual profiler, the timeline shows that the graph computations for our model occurs much more sparsely as compared to that of a ResNet18 model. Click for comparison and individual timelines

We think that the unused time between computations is the main culprit for the increased inference time for our model. However we’re not sure what could be causing the sparse computations and any help rendered would be greatly appreciated!

Best Regards,
tkchong




Hello,

While this thread was moved to the TX2 forums, we’d like to point out that the computations still occur sparsely when the model is run on host although not as severely.

Here is our comparison between the profiler timelines for the same model performing inference on a single image but on different devices.
The images of the profiler timelines have also been attached to this reply.

Thank you for your time.


Hi,

Thanks for your report. We need some information to give further suggestion.

1. Have you maximized the device performance first?

sudo nvpmodel -m 0
sudo jetson_clocks

2. Suppose you are using TF-TRT rather than pure TensorRT, is it correct?

3. Could you help to check if all the layer inside your model are natively support by TensorRT?
This information can be found from the log message.

Thanks.

Hi,

Thanks for your reply. Here is the information you requested:

  1. We conducted our experimentation with sudo jetson_clocks but not with sudo nvpmodel -m 0
    Here is the timeline from applying both commands prior to profiling. (available as an attachment as well). It seems the issue still persists despite maximizing device performance.

  2. We are using pure TensorRT.

  3. All layers converted to TensorRT FP16 from .uff without any issues. Our unit tests also show that the output of the model in TensorRT is correct.

We also conducted additional testing where we increased our kernel sizes two fold. The computations for each kernel take longer (which makes sense), decreasing the gaps between computations. However what is peculiar is that the total time for inference remains about the same which perhaps implies that the bottleneck is kernel launch latency. What we’d really want is for the kernels to execute back to back.


Hi,

It looks weird to me. The layers in your model should run very well with pure TensorRT.

Could you help us do another experiment?
We want to have a layer-level profiling result of TensorRT.

Here are the steps:

$ cp -r /usr/src/tensorrt/ .
$ cd tensorrt/samples/trtexec

Apply this patch

diff --git a/trtexec.cpp b/trtexec.cpp
index 95d01fb..449e8f4 100644
--- a/trtexec.cpp
+++ b/trtexec.cpp
@@ -143,6 +143,34 @@ float percentile(float percentage, std::vector<float>& times)
     return std::numeric_limits<float>::infinity();
 }
 
+struct Profiler : public IProfiler
+{
+    typedef std::pair<std::string, float> Record;
+    std::vector<Record> mProfile;
+
+    virtual void reportLayerTime(const char* layerName, float ms)
+    {
+        auto record = std::find_if(mProfile.begin(), mProfile.end(), [&](const Record& r){ return r.first == layerName; });
+        if (record == mProfile.end())
+            mProfile.push_back(std::make_pair(layerName, ms));
+        else
+            record->second += ms;
+    }
+
+    void printLayerTimes()
+    {
+        float totalTime = 0;
+        for (size_t i = 0; i < mProfile.size(); i++)
+        {
+            printf("%-40.40s %4.3fms\n", mProfile[i].first.c_str(), mProfile[i].second / gParams.avgRuns);
+            totalTime += mProfile[i].second;
+        }
+        printf("Time over all layers: %4.3f\n", totalTime / gParams.avgRuns);
+    }
+
+} gProfiler;
+
+
 class RndInt8Calibrator : public IInt8EntropyCalibrator2
 {
 public:
@@ -405,6 +433,7 @@ ICudaEngine* onnxToTRTModel()
 void doInference(ICudaEngine& engine)
 {
     IExecutionContext* context = engine.createExecutionContext();
+    context->setProfiler(&gProfiler);
 
     // Use an aliasing shared_ptr since we don't want engine to be deleted when bufferManager goes out of scope.
     std::shared_ptr<ICudaEngine> emptyPtr{};
@@ -438,10 +467,12 @@ void doInference(ICudaEngine& engine)
             times[i] = ms;
             totalGpu += ms;
         }
+        gProfiler.printLayerTimes();
         totalGpu /= gParams.avgRuns;
         totalHost /= gParams.avgRuns;
         gLogInfo << "Average over " << gParams.avgRuns << " runs is " << totalGpu << " ms (host walltime is " << totalHost
                  << " ms, " << static_cast<int>(gParams.pct) << "\% percentile time is " << percentile(gParams.pct, times) << ")." << std::endl;
+
     }
 
     if (gParams.dumpOutput)
$ make
$ cd ../../bin
$ ./trtexec xxxx

Thanks.

Hello, thanks for the reply!

We have run trtexec with our model on both our host system and the TX2.
We ran the executable with our .uff model file with minimal flags as such:

./trtexec --uff=/path/to/model.uff --uffInput=input_1,3,64,64 --output=ssr_function/mul_9

With input_1 and ssr_function/mul_9 corresponding to the input and output tensors of the model.

The results were piped to .txt files and have been attached as the following:

  1. Output on Host: trtexec_ssrnet_host.txt
  2. Output on TX2: trtexec_ssrnet_tx2.txt

We have also provided our model as a .pbtxt (as a .txt file because of the forum attachment limitations) file without loaded weights for your reference.
trtexec_ssrnet_host.txt (59.2 KB)
trtexec_ssrnet_tx2.txt (59.3 KB)
ssrnet_nchw(pbtxt).txt (150 KB)

Hello, for additional reference, we are attaching the output for running trtexec with a resnet on both out host system and the TX2.

  1. Output on Host: trtexec_resnetdream_host.txt
  2. Output on TX2: trtexec_resnetdream_TX2.txt

The input command is as follows:

./trtexec --uff=/path/to/resnet/model.uff --uffInput=input_img,3,224,224 --uffInput=input_yaw,1,1,1 --output=dream.add/add

trtexec_resnetdream_host.txt (18.6 KB)
trtexec_resnetdream_tx2.txt (18.7 KB)

Hi,

Thanks for your experiment.
It looks like the issue is from activation_N/Tanh layer.

In other operation, TX2 run around 2x slower than the host but 7x slower on the Tanh operation.
We are checking this issue with our internal team. Will update more information with you later.

Thanks.

Hi @AastaLLL,

Thanks for replying to us.

The activation_N/Tanh layers certainly are contributing to the inference time.
However we don’t think they can fully account for the gaps in kernel activity that we mentioned in our original post which trtexec does not show.
(post quoted below)

If the activation_N/Tanh layers were indeed the sole culprit, we believe that there should be a proportional increase in inference time when kernel size is increased. However in one of the experiments that we shared here, an increase in kernel size did not yield a proportional increase in inference time.
(post quoted below)

This perhaps indicates there is another factor contributing to the gaps between computations.

Thanks for taking the time to work on this issue.
If you need any additional information or data, do let us know! :-)

Hi @AastaLLL,

We did a bit a further experimentation, this time varying the inference batch size when running on the TX2.
While some of the gaps between kernels are filled up, it seems there’s still a significant amount of inactivity between kernels.
Also note that the inference times for our model for batch sizes 1, 2, 4, 8 and 16 remain around 8ms.

Here’s a comparison between the timelines of the model with batchsize=1 and batchsize=16.
We’ve also provided the .prof files for the model for batchsize={1, 2, 4, 8, 16} in the attachments.



ssrnet_tx2_batchsize_exp.zip (2.95 MB)

Sorry I for popping in the dicussion here. I am also facing similar issue as what Evan described, but with different model. Hope you can help us with the issue too.

Link : https://devtalk.nvidia.com/default/topic/1064724/tensorrt/slow-inferencing-on-tensorrt-with-gaps-in-between-processes/

Thanks!

Hi,

Thanks for keeping us updated.

The batchsize experiment is helpful.
When you building TensorRT engine from uff file, there is a parameter to specify the maximal batchsize.
May I know the value you set?

builder->setMaxBatchSize(maxBatchSize);

By the way, we also meet some TensorRT performance issue from other users.
Do you run any CUDA at the same time?

Or could you try to execute TensorRT with –useSpinWait option for us?

./trtexec --uff=/path/to/resnet/model.uff --uffInput=input_img,3,224,224 --uffInput=input_yaw,1,1,1 --output=dream.add/add --useSpinWait

Thanks and please let us know the result.

Good Evening,

Here are the answers to your queries!

  1. What is the Maximal Batchsize Set? The value of maxBatchSize in the following line of code is 16
    builder->setMaxBatchSize(maxBatchSize);
    
  2. Are there any other CUDA processes being run at the same time as the inference? No, other than the inference, we are not running any other CUDA processes.
  3. Output of TensorRT(trtexec) with --useSpinWait option We ran trtexec on the TX2 with the --useSpinWait option for both SSRNet (which is the model that exhibits the problem) and ResNet. The txtfile output for these have been attached to the post.

trtexec_resnetdream_useSpinWait.txt (18.7 KB)
trtexec_ssrnet_useSpinWait.txt (59.3 KB)

Hi,

We want to investigate this gap issue more and need to reproduce it on our environment.
Could you share the uff file with us?

Thanks.

Hi AastaLLL,

Attached is the uff file for SSRNet within the .zip
ssrnet_uff.zip (2.6 MB)

Hi,

It looks like the input/output name is different from the command in comment#7.
Could you share the input/output name of the shared uff file with us?

Thanks.

Hi AastaLLL,

My apologies.
Please use the following flags for input/output:

--uffInput=input_1,3,64,64 --output=ssr_function/mul_6

Hi,

Thanks for your help. We can reproduce this issue on our side.
It looks like the gap is generated from cudaEventSynchronize function.

We will check this issue with our internal team and update more information with you.

Hi,

Thanks for your patience. We have clarified this issue now.

Actually, this is not an issue.

Please noticed that there are two steps to launch TensorRT from a uff model.

  1. Compile the uff file into TensorRT engine
  2. Inference the TensorRT engine

The gap you observed between each kernel occurs when the compiling time.
In this step, TensorRT will evaluate kernel’s runtime and choose a fast one so the large gap between kernel is expected.

To profile the TensorRT inference time, please use the profiling data close to the end.
We can see only 0.05 ms for the max gap.

Thanks.