Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes

Originally published at: Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes | NVIDIA Technical Blog

NVIDIA NIM microservices are model inference containers that can be deployed on Kubernetes. In a production environment, it’s important to understand the compute and memory profile of these microservices to set up a successful autoscaling plan.  In this post, we describe how to set up and use Kubernetes Horizontal Pod Autoscaling (HPA) with an NVIDIA…

Thank you very useful. One question, I didnt see any GPU assignment in the your yaml. If there are not enough GPU resources on the same node, would this scale to another node with GPU ? What kind of parallelism would be used in this case ?

Thanks for your question. The GPU assignment is in the NIM LLM deployment which can either be using our NIM Operator or Helm charts. When HPA triggers more instances of the pod, Kubernetes can deploy them on any node that satisfies the compute request, so not necessarily on the same node. The incoming request parallelism happens at the Kubernetes Service level (round robin) that dispatches to all instances of the pods.

Hello, I’ve tried following your steps, but I can’t seem to exactly replicate your results.

Up until setting up the adapter, we can see the gpu_cache_usage_perc in prometheus, but it seems that doing
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/*/gpu_cache_usage_perc does not work, it simply shows a 404 in the prometheus-adapter container, while it shows all other metrics (e.g. DCGM metrics) from the /apis/custom.metrics.k8s.io/v1beta1 endpoint.

After inferring from other sources using HPA based on custom metrics we can see that a custom scraping config is required for prometheus-adapter, but we can’t seem to get it working. Can you please let us know if there are any step we’re missing? Thanks in advance!