Multithread inference

I implemented Object detector class which contains runtime cotext.
It runs on a single thread well(e.g. X fps).
But on multithreads (e.g. N thread),
it runs about X/N fps on each thread.
What shoud I check ?

And I had checked best practice
of tensorrt docs, but it is unclear for me.


Please noted that it’s required to use separate CUDA stream for parallel inference.
You can find an example in our trtexec binary directly:

/usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/mnist/mnist.onnx --streams=4

More, please also pay attention to the GPU workload of your detection model.
If it already occupies 99% utilization, multithreads need to wait in turn for the resources.


Thanks for your reply.
Do I need to anything to context execute ?

I generated my model using your command.
But the performance doesn’t change.
When I tried, I seemed to feel loading model time on others thread are shorter than first thread. Trt is used cache ?


Sorry for the late update.
Could you share the detailed performance you observed with us?

For example, we got 8.61235ms for stream=1, and 8.74668ms for stream=4 with ResNet50.onnx.

$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --streams=1
$ /usr/src/tensorrt/bin/trtexec --onnx=/usr/src/tensorrt/data/resnet50/ResNet50.onnx --streams=4

This indicates we can inference 4x input concurrently, and the elapsed time is roughly the same.


This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.