I use TensorRT to do the multi-threaded online serving. My model is built with dynamic shape support. The optimization profile is like that: min_dim = [2^k + 1, other_shape], opt_dim = [2^k, other_shape], max_dim = [2^k, other_shape], where k is the profile index inside [0, N) and other_shape is a deterministic integer number.
After I read the doc of TensorRT dynamic shape engine, I found that:
(1) IExecutionContext is not thread-safe.
(2) Different IExecutionContext objects created from the same ICudaEngine object cannot share the same optimization profile index. That is to say, the optimization profile index must be different in different IExecutionContext objects.
(3) It is recommended that each IExecutionContext object should use a unique CUDA stream.
In conclusion, when I use the engine with dynamic shapes, I should:
(1) create N IExecutionContext objects.
(2) set the optimization profile to be 0,1,…N-1 for each IExecutionContext object.
(3) create N CUDA streams, and assign each of them to each IExecutionContext object.
My question is: if most of the requested input shapes are inside the same optimization profile, it means that these requests would be processed by a single IExecutionContext object and a single CUDA stream. These requests can only be processed one by one but not in parallel. How to solve this problem if I want more parallelism? Is there any best practice?