What is the load balancer strategy for MIMs?

What type of load balancer does NVIDIA recommend or have examples for a pool of LLMs running in a K8s cluster fielding API requests?

  1. Are the NIMS single threaded with one GPU (or set) per NIM instance?
  2. What kind of load balancer or lb strategy handles that type of restriction?

Found this doc in another post. It describes using Traefik for speech services layer 7 load balancing on K8s.

What other docs should we look at?

So from a CPU side yes the NIMs are single threaded, but the bulk of the work is being done on the GPU. The GPU can handle multiple requests in parallel using a technique called continuous/in-flight batching (this blog has a good explanation: Achieve 23x LLM Inference Throughput & Reduce p50 Latency). That said, there is still a limit to the number of requests that a single GPU (or set of GPUs) can handle but our recommendation for load balancing at the moment is to use your k8s providers built in LoadBalancer service type. If your k8s provider doesn’t provide a loadbalancer out of the box, I’d recommend looking at MetalLB: https://metallb.universe.tf/

You could take a look at the NIM Kubernetes Operator for some inspiration: GitHub - NVIDIA/k8s-nim-operator: An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.