I use TensorRT to do the multi-threaded online serving. My model is built with dynamic shape support. The optimization profile is like that: min_dim = [2^k + 1, other_shape], opt_dim = [2^k, other_shape], max_dim = [2^k, other_shape], where k is the profile index inside [0, N) and other_shape is a deterministic integer number.
After I read the doc of TensorRT dynamic shape engine, I found that:
(1) IExecutionContext is not thread-safe.
(2) Different IExecutionContext objects created from the same ICudaEngine object cannot share the same optimization profile index. That is to say, the optimization profile index must be different in different IExecutionContext objects.
(3) It is recommended that each IExecutionContext object should use a unique CUDA stream.
In conclusion, when I use the engine with dynamic shapes, I should:
(1) create N IExecutionContext objects.
(2) set the optimization profile to be 0,1,…N-1 for each IExecutionContext object.
(3) create N CUDA streams, and assign each of them to each IExecutionContext object.
My question is: if most of the requested input shapes are inside the same optimization profile, it means that these requests would be processed by a single IExecutionContext object and a single CUDA stream. These requests can only be processed one by one but not in parallel. How to solve this problem if I want more parallelism? Is there any best practice?
You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation
Also, request you to share your model and script if not shared already so that we can help you better.
Your understanding of execution context is correct. For now, we suggest you to create multiple profiles with same size or have multiple engine instances.
Thanks! As you have suggested, I can create multiple engine instances. Is there any method that I can share the weights between these different ICudaEngine objects?
Thank you for your reply. I still have one question: if I deserialize a same engine file to create multiple ICudaEngine instances, is there any method that I can share the weights between these different ICudaEngine instances? Since these instances are deserialized from the same engine file, they can share the same weights so that I can save the GPU memory consumption.
Looks like there is no ideal solution for this as of now. We recommend you to try create multiple profile, and assign each profile to one of the instance instead of deserialize multiple cuda engines.