TensorRT: parallel inference when most input shapes locate in the same optimization profile

I use TensorRT to do the multi-threaded online serving. My model is built with dynamic shape support. The optimization profile is like that: min_dim = [2^k + 1, other_shape], opt_dim = [2^k, other_shape], max_dim = [2^k, other_shape], where k is the profile index inside [0, N) and other_shape is a deterministic integer number.

After I read the doc of TensorRT dynamic shape engine, I found that:
(1) IExecutionContext is not thread-safe.
(2) Different IExecutionContext objects created from the same ICudaEngine object cannot share the same optimization profile index. That is to say, the optimization profile index must be different in different IExecutionContext objects.
(3) It is recommended that each IExecutionContext object should use a unique CUDA stream.

In conclusion, when I use the engine with dynamic shapes, I should:
(1) create N IExecutionContext objects.
(2) set the optimization profile to be 0,1,…N-1 for each IExecutionContext object.
(3) create N CUDA streams, and assign each of them to each IExecutionContext object.

My question is: if most of the requested input shapes are inside the same optimization profile, it means that these requests would be processed by a single IExecutionContext object and a single CUDA stream. These requests can only be processed one by one but not in parallel. How to solve this problem if I want more parallelism? Is there any best practice?

Hi,
Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Thanks!

@NVES I am sorry that the docs you recommended do not solve my problem. Could you please take a look in detail?

Hi @sneaxiy,

Your understanding of execution context is correct. For now, we suggest you to create multiple profiles with same size or have multiple engine instances.

Thank you.

Thanks! As you have suggested, I can create multiple engine instances. Is there any method that I can share the weights between these different ICudaEngine objects?

Hi @sneaxiy,

You can create single engine with multiple contexts. Weights would be shared among the contexts.

Thank you.

Thank you for your reply. I still have one question: if I deserialize a same engine file to create multiple ICudaEngine instances, is there any method that I can share the weights between these different ICudaEngine instances? Since these instances are deserialized from the same engine file, they can share the same weights so that I can save the GPU memory consumption.

Hi @sneaxiy,

Looks like there is no ideal solution for this as of now. We recommend you to try create multiple profile, and assign each profile to one of the instance instead of deserialize multiple cuda engines.