hello, I am using tensorrt on single GPU, the inference code is as follows:
…
self.engine = tr.utils.load_engine(G_LOGGER, plan_file)
self.context = self.engine.create_execution_context()
…
self.cuda_context.push()
stream = self.cuda.Stream()
self.cuda.memcpy_htod_async(self.d_input, data, stream)
stream.synchronize()
self.context.enqueue(self.batch_size, self.bindings, stream.handle, None)
stream.synchronize()
self.cuda.memcpy_dtoh_async(self.output, self.d_output, stream)
stream.synchronize()
self.cuda_context.pop()
…
when I run 1 instance(process, not thread), the inference time of one image is about 300ms
when I run 2 instances, the inference time increases to 650ms
when I run 4 instances, the inference time increases to 1200ms
Obviously, the inference time is linearly increasing when running more instances.
I don’t know why? and how to solve this?
Test environment:
Tensorrt: 4.0
GPU: 1080 Ti
ubuntu 16.04
cuda: 9.0
cudnn: 7.3.1