Model deployed on kubernetes is unable to use available GPU memory fully

Hello everyone.

Not sure if I opened this topic in the right place but since it is related to the deployment issues of one of the NIM models, I decided to place it here and if there is more preferable category for it, please help me to move it there then.

I am Infrastructure and DevOps engineer in one of the startup companies and our Data Science team wants to try one of your model - nvidia-nemotron-nano-9b-v2. Since all our infrastructure is running on kubernetes cluster and the NIM models are also actual containers, I decided to deploy it on our AWS EKS managed kubernetes cluster, just like any other services.

For deployment I am using Nvidia’s own Helm Chart nim-llm-1.15.3. I have also read the Model Card and checked the list of the supported hardware for this particular model. The documentation says that these are the supported GPUs: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100. So, I filtered all the AWS EC2 instances with those GPUs on board and started from the cheapest obviously. I started with the g5.4xlarge providing 1x GPU A10G with 24GB of GPU memory and I end up with the largest instance with the compatible GPU - p4d.24xlarge providing 8x A100 GPU with 40GB of memory per GPU (320GB in total). But no matter of the instance type I tried, each time I am getting the error saying that there is not enough available video memory on the host. This is the exact error from the container while running on p4d.24xlarge instance:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.75 GiB. GPU 0 has a total capacity of 39.49 GiB of which 21.94 GiB is free. Including non-PyTorch memory, this process has 17.54 GiB memory in use. Of the allocated memory 17.06 GiB is allocated by PyTorch, and 1.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I use kubernetes affinities and tolerations for distributing/scheduling the Pod/Containers on corresponding types of worker nodes. So, this p4d.24xlarge instance was entirely dedicated to only this model. Besides the Pod running the model, there were only 5 other daemonset pods on the host: 4 of them are infrastructure related, (like networking, monitoring, storage provider) and 1 nvidia device plugin (which as I have read is necessary for proper GPU utilization on kubernetes environment). You can check the screenshots of node specification and the pods running on it below:

Please help me understand, why container cannot use the entire GPU memory? Could that missing 17GB of memory used by the device plugin pod and if yes, then how can I limit it’s video memory usage to free up more for the model?