the inference time increases linearly when running more than 2 tensorrt instance on single GPU

hello, I am using tensorrt on single GPU, the inference code is as follows:


self.engine = tr.utils.load_engine(G_LOGGER, plan_file)
self.context = self.engine.create_execution_context()

self.cuda_context.push()
stream = self.cuda.Stream()
self.cuda.memcpy_htod_async(self.d_input, data, stream)
stream.synchronize()
self.context.enqueue(self.batch_size, self.bindings, stream.handle, None)
stream.synchronize()
self.cuda.memcpy_dtoh_async(self.output, self.d_output, stream)
stream.synchronize()
self.cuda_context.pop()

when I run 1 instance(process, not thread), the inference time of one image is about 300ms
when I run 2 instances, the inference time increases to 650ms
when I run 4 instances, the inference time increases to 1200ms

Obviously, the inference time is linearly increasing when running more instances.
I don’t know why? and how to solve this?

Test environment:
Tensorrt: 4.0
GPU: 1080 Ti
ubuntu 16.04
cuda: 9.0
cudnn: 7.3.1

Hello,

Not clear based on the code snippet you provided, but please verify you launched the TensorRT models with separate execution context. This is essential for running inference in parallel or some latency will occur for sharing the GPU resource.

Also, we recommended to profile GPU utilization with nvprof first.