Model deployed on kubernetes is unable to use available GPU memory fully

giorgi.gujabidze · December 19, 2025, 8:42am

Hello everyone.

Not sure if I opened this topic in the right place but since it is related to the deployment issues of one of the NIM models, I decided to place it here and if there is more preferable category for it, please help me to move it there then.

I am Infrastructure and DevOps engineer in one of the startup companies and our Data Science team wants to try one of your model - nvidia-nemotron-nano-9b-v2. Since all our infrastructure is running on kubernetes cluster and the NIM models are also actual containers, I decided to deploy it on our AWS EKS managed kubernetes cluster, just like any other services.

For deployment I am using Nvidia’s own Helm Chart nim-llm-1.15.3. I have also read the Model Card and checked the list of the supported hardware for this particular model. The documentation says that these are the supported GPUs: NVIDIA A10G, NVIDIA H100-80GB, NVIDIA A100. So, I filtered all the AWS EC2 instances with those GPUs on board and started from the cheapest obviously. I started with the g5.4xlarge providing 1x GPU A10G with 24GB of GPU memory and I end up with the largest instance with the compatible GPU - p4d.24xlarge providing 8x A100 GPU with 40GB of memory per GPU (320GB in total). But no matter of the instance type I tried, each time I am getting the error saying that there is not enough available video memory on the host. This is the exact error from the container while running on p4d.24xlarge instance:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 33.75 GiB. GPU 0 has a total capacity of 39.49 GiB of which 21.94 GiB is free. Including non-PyTorch memory, this process has 17.54 GiB memory in use. Of the allocated memory 17.06 GiB is allocated by PyTorch, and 1.56 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

I use kubernetes affinities and tolerations for distributing/scheduling the Pod/Containers on corresponding types of worker nodes. So, this p4d.24xlarge instance was entirely dedicated to only this model. Besides the Pod running the model, there were only 5 other daemonset pods on the host: 4 of them are infrastructure related, (like networking, monitoring, storage provider) and 1 nvidia device plugin (which as I have read is necessary for proper GPU utilization on kubernetes environment). You can check the screenshots of node specification and the pods running on it below:

Please help me understand, why container cannot use the entire GPU memory? Could that missing 17GB of memory used by the device plugin pod and if yes, then how can I limit it’s video memory usage to free up more for the model?

Topic		Replies	Views
Kubernetes Jetson Cluster and Tensorflow not recognizing all GPU memory Jetson Nano jetson-inference , nvbugs	10	1713	June 29, 2022
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	700	September 20, 2024
Issues while starting NIM container in A10 VM Models nim , llama3-8b-instruct	4	239	September 4, 2024
NIM - Llama3-8b-Instruct - GPU resource usage is very high Models nim , llama3-8b-instruct	0	81	March 12, 2025
Second NIM container won't start due to less than desired GPU memory utilization DGX Spark / GB10 docker , nim , llama-31-8b-instruct , llama	10	228	December 3, 2025
Blueprint RAG v2.0.0 NVIDIA Blueprints nim , llama-31-70b-instruct , llama , blueprints	1	169	April 24, 2025
Unable to Run Evo2-40B on a Single H100 GPU Due to 'CUDA out of memory' Error BioNeMo nim	1	214	March 10, 2025
How to use GPUs more effectively with Kubernetes, GPU Operator, Triton Inference and Nvidia A16 TensorRT kubernetes	0	129	October 17, 2024
NIM - Llama 3 8B Instruct - Results were very weirdn Models nim	1	419	August 27, 2024
Nemollm-inference-microservice failed to deploy Models nim , llama3-8b-instruct , llama	1	236	October 22, 2024

Model deployed on kubernetes is unable to use available GPU memory fully

Related topics