TRTIS is a container that manipulates all inference models, can it manipulate containers?
In a containerized environment, user may put his developed inference program in a container and then can deploy it. If using TRTIS instead, user is required additional efforts to integrate his inference program to TRTIS, which is much less convenient. I understand current architecture of TRTIS brings it high throughput and GPU utilization. Is it possible to expand TRTIS to support manipulating containers so that user does not need to make additional integration effort?
I think about how to utilize TRTIS’s strengths (e.g. concurrent model execution) to support containers. For example, providing specialized docker image for users (inference program developer), this specialized docker image contains specialized AI frameworks (e.g. specialized TensorFlow, Caffe,…) that can transform model execution instances to CUDA streams for parallel computation, just like TRTIS server does.
Considering a Kubernetes environment as the attached figure, each of App1 App2… is a container that can serve a certain kind of inference, and the inference model within. However, the underlying GPUs could not be utilized to their best utilization, so we’d like to introduce and benefit from TRTIS. But current design of TRTIS requires all models to be manipulated inside of TRTIS. So, as the first post above, can TRTIS be expanded to manipulate containers, or how to leverage TRTIS’s strengths (e.g. concurrent model execution) to support containers.
If I’m understanding correctly I think what you want to do to get maximum GPU utilization is to have app1 and app2 send requests to TRTIS so that TRTIS can perform the inference for them. The app1 and app2 inference models would need to be handled by TRTIS. This may or may not be something you can do.