the inference time increases linearly when running more than 2 tensorrt instance on single GPU

jolly.ming2005 · April 4, 2019, 11:36am

hello, I am using tensorrt on single GPU, the inference code is as follows:

…
self.engine = tr.utils.load_engine(G_LOGGER, plan_file)
self.context = self.engine.create_execution_context()
…
self.cuda_context.push()
stream = self.cuda.Stream()
self.cuda.memcpy_htod_async(self.d_input, data, stream)
stream.synchronize()
self.context.enqueue(self.batch_size, self.bindings, stream.handle, None)
stream.synchronize()
self.cuda.memcpy_dtoh_async(self.output, self.d_output, stream)
stream.synchronize()
self.cuda_context.pop()
…

when I run 1 instance(process, not thread), the inference time of one image is about 300ms
when I run 2 instances, the inference time increases to 650ms
when I run 4 instances, the inference time increases to 1200ms

Obviously, the inference time is linearly increasing when running more instances.
I don’t know why? and how to solve this?

Test environment：
Tensorrt: 4.0
GPU: 1080 Ti
ubuntu 16.04
cuda: 9.0
cudnn: 7.3.1

NVES · April 4, 2019, 4:31pm

Hello,

Not clear based on the code snippet you provided, but please verify you launched the TensorRT models with separate execution context. This is essential for running inference in parallel or some latency will occur for sharing the GPU resource.

Also, we recommended to profile GPU utilization with nvprof first.

Topic		Replies	Views
Ideas to maximize throughput using TensorRT TensorRT	1	414	November 20, 2020
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2640	March 30, 2023
the latency time is linearly increasing when concurrent threads increase more than 2 TensorRT	6	1374	March 15, 2019
Why the inference time of TensorRT enqueuev2 goes up gradually? TensorRT	1	514	December 31, 2023
Latency when running TensorRT engine on two GPU TensorRT	9	1342	August 24, 2020
Batch Inference using BatchSize=8 takes nearly as long as 8 individual runs of BatchSize=1 TensorRT	3	1365	July 20, 2021
How to run multi trt model instance in single gpu efficentilly? TensorRT	0	720	June 20, 2019
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1284	May 11, 2021
Tensor RT server with GPU only instances high CPU usage Triton Inference Server (archived)	4	2696	February 27, 2020
Running Real-Time Instance Segmentation with Local GPUs TensorRT tensorrt , camera , ros , python , cudnn	2	147	February 18, 2025

the inference time increases linearly when running more than 2 tensorrt instance on single GPU

hello, I am using tensorrt on single GPU, the inference code is as follows:

Obviously, the inference time is linearly increasing when running more instances. I don’t know why? and how to solve this?

Related topics

Obviously, the inference time is linearly increasing when running more instances.
I don’t know why? and how to solve this?