So from a CPU side yes the NIMs are single threaded, but the bulk of the work is being done on the GPU. The GPU can handle multiple requests in parallel using a technique called continuous/in-flight batching (this blog has a good explanation: Achieve 23x LLM Inference Throughput & Reduce p50 Latency). That said, there is still a limit to the number of requests that a single GPU (or set of GPUs) can handle but our recommendation for load balancing at the moment is to use your k8s providers built in LoadBalancer service type. If your k8s provider doesn’t provide a loadbalancer out of the box, I’d recommend looking at MetalLB: https://metallb.universe.tf/