Improving GPU Utilization in Kubernetes

Originally published at: https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

To improve NVIDIA GPU utilization in K8s clusters, we offer new GPU time-slicing APIs, enabling multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.

Hi all,

I was wondering if it it is possible to enable time-slicing after having installed the NVIDIA’s operator chart.

I have a fully-working k8s cluster with GPUs and I prefer not to “break” it. So I am trying the following:

I create a configmap containing the configuration:

kubectl create configmap time-slicing --from-file dp-example-config.yaml -n gpu-operator

(the same .yaml you use in your example)

And I upgrade the operator release (after having seen its Helm’s values structure to apply config):

helm upgrade --wait \
    -n gpu-operator \
    gpu-operator-1660041098 nvidia/gpu-operator \
    --set driver.enabled=false \
    --set devicePlugin.config.name=time-slicing

I can see the .yaml file mounted from the configmap in /available-configs both in config-manager and nvidia-device-plugin pods of the StatefulSet. However time-slicing configuration is not applied yet.

I noticed nvcr.io/nvidia/k8s-device-plugin:v0.12.2-ubi8 image is used for those pods instead of just v0.12.2.

Am I missing something? Is any other approach available for running environments?

Many thanks in advance!

Sergio

Hi @sergio.garcia - thanks for reading the blog and your comment!

The gpu-operator Helm chart provides a default value on the devicePlugin. To set a default config across the cluster, you would need to specify a parameter of devicePlugin.config.default=<config-name> or in your case, devicePlugin.config.default=time-slicing. If no config is set as default, then node labeling is required so that those nodes get the new plugin configuration.

Also - you can also file a GitHub issue on the gpu-operator project in the future at Issues · NVIDIA/gpu-operator · GitHub for questions or issues.

Hi @P_Ramarao,

Many thanks for your response. I’ve been on vacation and hadn’t been able to try it.

I managed to enable time-slicing using Operator’s chart. However I think the devicePlugin.config.default value (that I was leaving blank) must include the actual name of the .yaml included in the ConfigMap (dp-example-config.yaml in my previous example). Don’t you agree?
In chart’s values this option is described as “# Default config name within the ConfigMap”.

Best,

Sergio