TensorRT Batching Speed scales poorly

Description

Batching speed scales linearly. Theoretically, it should not be linear as the batch operations are embarrassingly parallel.

Environment

TensorRT Version: 7.2.3
GPU Type: GTX 2080
Nvidia Driver Version: 460.91
CUDA Version: 11.1
CUDNN Version: 8.1.1
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.8.1
Baremetal or Container (if container which image + tag):

Relevant Files

Github Repo of YoloV5 TensorRT: tensorrtx/yolov5 at master · wang-xinyu/tensorrtx · GitHub

Run the .engine file, see batch timing. All batch testing is done over at least 20 iterations.

Under FP16:

Batch size 1: 1ms
Batch size 2: 2ms
Batch size 4: 4ms
Batch size 8: 8ms
Batch size 9: 9ms

Under FP32:

Batch size 1: 3ms
Batch size 2: 5ms
Batch size 4: 9ms
Batch size 8: 17-18ms
Batch size 9: 19-20ms

Under FP16, it seems weird to me that the batching is scaling linearly. Theoretically, batching speed should not scale linearly due to parallel operations unless a single batch already hits the full limit of the GPU. The input image to the model is 480x480.

Edit: Follow the GitHub discussion here: [Discussion] TensorRT Batching Speed scales poorly (tested YoloV5 repo) · Issue #718 · wang-xinyu/tensorrtx · GitHub

Hi,

We recommend you to please try on the latest TensorRT version, While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Also you can run nvidia-smi dmon -s u to check gpu utilization (for different batch sizes). Or use Nsight Systems to visualize the profiles: NVIDIA Nsight Systems | NVIDIA Developer

Thanks!

1 Like

Hi @spolisetty,

Thank you for the prompt reply. The measurement of the model performance (inference speed) is similar to the TensorRT docs code (only focuses on the network inference):

#include <chrono>

auto startTime = std::chrono::high_resolution_clock::now();
context->enqueueV2(&buffers[0], stream, nullptr);
cudaStreamSynchronize(stream);
auto endTime = std::chrono::high_resolution_clock::now();
float totalTime = std::chrono::duration<float, std::milli>
(endTime - startTime).count();

However, a difference in the code is an inclusion of the batch size to allow different numbers of batch sizes:

....(IExecutionContext& context.....)
.
.
.
context.enqueue(batchSize, buffers, stream, nullptr);

From the docs (pretty much what I am thinking):

“In TensorRT, a batch is a collection of inputs that can all be processed uniformly. Each instance in the batch has the same shape and flows through the network in exactly the same way. Each instance can, therefore, be trivially computed in parallel.”

I guess there could be a possibility that even though we are running the inference with input batches, there is just not enough compute for the GPU to run it in parallel and it ends up running a single batch per inference instead of running all batches in parallel. I will attempt the profiling and get back to you soon, I am not very familiar with it.

Hi,

Could you please share minimal issue repro complete script/model to try from our end for better debugging. Also as mentioned previously you can check gpu utilization first using command shared. Based on it we can try profiling.

Thank you.

1 Like

Hi @spolisetty,

I tried the command u gave me nvidia-smi dmon -s u and ran the code that loads the engine and run inference. I ran the inference a few times for every batch.

1 . Batch 9 outputs, it seems that when I run the command the SM goes to 15, MEM goes to 8 ( I ran it a few times which is why u see that number appears a few time):

# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
    0     1     0     0     0
    0     3     0     0     0
    0     0     0     0     0
    0    15     8     0     0
    0     1     0     0     0
    0     0     0     0     0
    0    15     8     0     0
    0     1     0     0     0
    0     0     0     0     0
    0     0     0     0     0
    0     4     0     0     0
    0    15     8     0     0
    0     2     0     0     0
    0    15     8     0     0
    0    15     8     0     0
    0     8     1     0     0
    0     2     0     0     0
    0    15     8     0     0
    0    16     8     0     0
    0     0     0     0     0
    0    15     8     0     0
    0     3     0     0     0
    0    15     8     0     0
    0     0     0     0     0
    0     5     0     0     0
  1. Batch 8 outputs (when I run this one, it gives 12 SM 6 MEM but sometimes it is 13/14SM 7 MEM:
# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
    0     5     0     0     0
    0     1     0     0     0
    0     4     0     0     0
    0    12     6     0     0
    0     2     1     0     0
    0     0     0     0     0
    0     0     0     0     0
    0     1     0     0     0
    0     4     1     0     0
    0     1     0     0     0
    0     4     2     0     0
    0    11     6     0     0
    0     0     0     0     0
    0     0     0     0     0
    0     1     0     0     0
    0     4     1     0     0
    0    13     7     0     0
    0    14     7     0     0
    0    14     7     0     0
    0    13     7     0     0
    0    14     7     0     0
    0     1     0     0     0
    0     2     0     0     0
    0     0     0     0     0
    0     1     0     0     0
    0    12     6     0     0
    0     7     4     0     0
    0     1     0     0     0
    0     0     0     0     0
    0     6     0     0     0
    0    11     1     0     0
    0    14     7     0     0
    0    13     7     0     0
    0    14     7     0     0
    0     3     0     0     0
  1. Batch 4 (number fluctuates):
# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
    0     2     0     0     0
    0     2     0     0     0
    0     1     0     0     0
    0    12     5     0     0
    0    13     6     0     0
    0     8     3     0     0
    0     9     4     0     0
    0    19     7     0     0
    0     0     0     0     0
    0     2     0     0     0
    0    11     4     0     0
    0    14     6     0     0
    0     8     3     0     0
    0     8     3     0     0
    0    12     5     0     0
    0     1     0     0     0
    0     0     0     0     0
    0     0     0     0     0
    0     0     0     0     0
    0     2     0     0     0
    0     9     4     0     0
    0    15     7     0     0
    0     8     3     0     0
    0     9     4     0     0
    0    14     7     0     0
    0     0     0     0     0
    0     1     0     0     0
    0     4     0     0     0
    0     9     4     0     0
    0    10     4     0     0
    0    12     5     0     0
    0    10     4     0     0
    0     9     4     0     0
  1. Lastly, test for batch 1:
# gpu    sm   mem   enc   dec
# Idx     %     %     %     %
    0     2     0     0     0
    0     0     0     0     0
    0     5     0     0     0
    0    16     4     0     0
    0    16     4     0     0
    0    15     4     0     0
    0    15     4     0     0
    0     0     0     0     0
    0     1     0     0     0
    0     0     0     0     0
    0     6     1     0     0
    0    15     4     0     0
    0    15     4     0     0
    0    15     4     0     0
    0    15     4     0     0
    0     5     1     0     0
    0     5     0     0     0
    0    15     4     0     0
    0    10     3     0     0
    0    15     4     0     0
    0    15     4     0     0
    0    15     4     0     0
    0     6     1     0     0
    0     0     0     0     0
    0    15     4     0     0
    0    16     4     0     0
    0    15     4     0     0
    0    16     4     0     0
    0    16     4     0     0
    0    12     3     0     0
    0     6     1     0     0

I am not sure whether are the outputs of any use for you to make any conclusion. Is SM % the % of SM in used and %Mem the % of GPU memory?

Regarding Nsight systems profiling, I cannot seem to get it to work, a lot of errors and the documentation does not really teach you how to use it.

Minimal issue repro:

  1. Go tensorrtx/yolov5 at master · wang-xinyu/tensorrtx · GitHub and follow the readme ( it will take just 5minutes to set up).

  2. Once you get the TensorRT engine file, I ran sudo ./yolov5 -d yolov5.engine ../samples to get the inference speed and also GPU utilisation check.

  3. Store around 100 images in the samples folder so that the model inference can be done multiple times.

You can change the batch size at yolov5.cpp under the global variable BATCH_SIZE. Change it to 4,8,9 and see whether does your inference speed change linearly. My observation is that the inference speed scales linearly with batch size under FP16.

Hi @timlim_ai,

Yes, SM % indicates the GPU utilization. Looks like GPU utilization is being same for increased batch size as well, which increases the inference time. If possible could you please share us onnx model and minimal issue repro inference script. And have you observed same behavior when when you tried trtexec ?

Thank you.

Hi @spolisetty. The repository convert PyTorch saved weights (.pth file to .wts file) and then build the TensorRT engine via API calls and not any of the parser. I am not too sure whether could this be the cause.

I will try it out on a simple network (lenet) first. I am not very familiar with TensorRT and the tutorial feels like very insufficient, I will take some time to get back to you regarding the ONNX model. On a side question, what is the current best way to convert PyTorch saved weights and model structure to TensorRT now?