I read in the seldon core documentation that multi-model serving with overcommit is available out of the box on nvidia triton
https://docs.seldon.io/projects/seldon-core/en/v2/contents/models/mms/mms.html?highlight=multi%20modal%20serving
Can you please share documentation on how to configure and implement multi-model serving with overcommit using Nvida Triton
Hi,
The below links might be useful for you.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html
For multi-threading/streaming, will suggest you to use Deepstream or TRITON
For more details, we recommend you raise the query in Deepstream forum.
or
raise the query in Triton Inference Server Github instance issues section.
Thanks!