Deploying a Natural Language Processing Service on a Kubernetes Cluster with Helm Charts from NVIDIA NGC

Originally published at:

Conversational AI solutions such as chatbots are now deployed in the data center, on the cloud, and at the edge to deliver lower latency and high quality of service while meeting an ever-increasing demand. The strategic decision to run AI inference on any or all these compute platforms varies not only by the use case…

Not directly relevant but hopefully, we can see an on-prem version of image based inference example, with multinode autoscaling. There are not much how-to data about on-prem k8s + Triton server, with horizontal PA.

On-prem scenario is very relevant as well although what we covered in the blog above should be mostly applicable to on-prem deployment as well. Nevertheless, we’ll consider it for future writing.


1 Like

Thanks @jamess , I keep investigating possibilities of load balancing and autoscaling (particularly with KFserving/kubernetes), but its seems chances are very little for a no-cloud solution, one is
I really apreciate if you provide some documentation for load balancing/scaling of local&multinode kubernetes installations, I believe this may be a common case for many people out there, who are trying to develop small scale setups, prior to move billing-clouds.

Thank you for your suggestions. We will see what we can do!