How to increase TensorRT GPU utilization for lots of requests?


I am using TensorRT C++ API to deploy a model inference service. The problem I met is that under multi-threading conditions with lots of requests, the GPU utilization seems to be 70-80%.

My implementation is to create an engine with single execution context, and uses a std::mutex to protect the inference function call. The inference function looks like below.

int inference(std::vector& input_data) {
std::lock_guardstd::mutex guard(mMtx);
// copying data from host to device
// copying data from device to host

This locked part involves some memory copy steps, which will cause stalls of GPU execution.

Currently I choose a WAR which creates multiple processes, each process will create a TRT engine & context. By spreading requests to processes, the CPU latency can be hidden and the GPU utilization will increase.

But I am not sure whether it is a good practice.

I have also tried two other things,

  1. Create multiple engines and contexts in a single process. However, this won’t work. The process crashes and the debug message seems to be related to optimization profiles.

  2. Build the engine with multiple optimization profiles, and create multiple execution contexts in runtime. This will better utilize GPU when request number is large. But if there is, for example, only one request, the GPU latency seems to be greater in comparison with single context scheme.

Can I ask that what is the best practice for this scenario? Thanks!


TensorRT Version: TRT 7.0
GPU Type: Tesla T4
Nvidia Driver Version: 418.116.00
CUDA Version: cuda 10.2
CUDNN Version: cudnn 7.6.5
Operating System + Version: Debian 9
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):

Hi, Request you to share the model, script, profiler and performance output so that we can help you better.

Alternatively, you can try running your model with trtexec command

or view these tips for optimizing performance


Hi @lihui.sun,

Please try following which may help you.

int inference(std::vector& input_data) {
std::lock_guardstd::mutex guard(mMtx);
// copying data from host to device
mContext->enqueueV2(bindings, stream, nullptr);
// copying data from device to host

Thank you.

Thank you, I will have a try and update results.