Improving GPU Utilization in Kubernetes

Originally published at: https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

To improve NVIDIA GPU utilization in K8s clusters, we offer new GPU time-slicing APIs, enabling multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.

Hi all,

I was wondering if it it is possible to enable time-slicing after having installed the NVIDIA’s operator chart.

I have a fully-working k8s cluster with GPUs and I prefer not to “break” it. So I am trying the following:

I create a configmap containing the configuration:

kubectl create configmap time-slicing --from-file dp-example-config.yaml -n gpu-operator

(the same .yaml you use in your example)

And I upgrade the operator release (after having seen its Helm’s values structure to apply config):

helm upgrade --wait \
    -n gpu-operator \
    gpu-operator-1660041098 nvidia/gpu-operator \
    --set driver.enabled=false \
    --set devicePlugin.config.name=time-slicing

I can see the .yaml file mounted from the configmap in /available-configs both in config-manager and nvidia-device-plugin pods of the StatefulSet. However time-slicing configuration is not applied yet.

I noticed nvcr.io/nvidia/k8s-device-plugin:v0.12.2-ubi8 image is used for those pods instead of just v0.12.2.

Am I missing something? Is any other approach available for running environments?

Many thanks in advance!

Sergio

Hi @sergio.garcia - thanks for reading the blog and your comment!

The gpu-operator Helm chart provides a default value on the devicePlugin. To set a default config across the cluster, you would need to specify a parameter of devicePlugin.config.default=<config-name> or in your case, devicePlugin.config.default=time-slicing. If no config is set as default, then node labeling is required so that those nodes get the new plugin configuration.

Also - you can also file a GitHub issue on the gpu-operator project in the future at Issues · NVIDIA/gpu-operator · GitHub for questions or issues.

Hi @P_Ramarao,

Many thanks for your response. I’ve been on vacation and hadn’t been able to try it.

I managed to enable time-slicing using Operator’s chart. However I think the devicePlugin.config.default value (that I was leaving blank) must include the actual name of the .yaml included in the ConfigMap (dp-example-config.yaml in my previous example). Don’t you agree?
In chart’s values this option is described as “# Default config name within the ConfigMap”.

Best,

Sergio

Does Triton also provide the same oversubscription functionality as the device plugin? It is able to run multiple models on the same device concurrently, so seems quite similar. What are the differences between the two approaches and how to choose between them?

Hi @P_Ramarao is the replica in the sharing configuration of time-slicing split the GPU memory equally, i.e for a 16 GB GPU if I specify a replica of 2 does the memory get split into half like 8GB for a single subscription by a Pod.

sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 2 

and if the replicas is 5 would the memory be split into 16\5 for each subscription of the GPU.
Would be really great if you could clear this out.
Thanks

hi @adn

Triton also provide the same oversubscription functionality as the device plugin? It is able to run multiple models on the same device concurrently

Triton does not use time-slicing for oversubscription. Triton does allow multiple models to be executed concurrently - but it uses the CUDA streams API to do so (i.e. each model is executed via a different CUDA stream concurrently on the GPU). We also detailed CUDA streams in the blog - so there are tradeoffs to using CUDA streams.

Hope that helps.

Hi @sam137

is the replica in the sharing configuration of time-slicing split the GPU memory equally

No - the time-slicing capability does not partition memory. Each process running on the GPU has full access to the GPU - only execution contexts are swapped in & out by the scheduler. We mention in the blog that the tradeoff with using time-slicing is that you (as the application developer or devops) needs to be sure that one of the process doesn’t end up allocating all the memory on the GPU (so the other process may suffer from an OOM).

The time-slicing support in the device plugin simply provides an oversubscription model on the number of GPU devices available - so that two different containers can land on the same GPU and thus time-slice (from an execution perspective).

Hope that clarifies.

1 Like

Thanks @P_Ramarao that clears it.

In the article " Improving GPU Utilization in Kubernetes" you mention the following:
“The tradeoffs with MPS are the limitations with error isolation, memory protection, and quality of service (QoS)”
However, table 1 shows that MPS supports memory protection. Could you please clarify which is correct?
Thank you in advance,
Manos

Hi @P_Ramarao

I’d like to understand time-slicing in terms of what we had before. Is time-slicing for Kubernetes equivalent to how 3D apps run on a regular desktop Linux? So if we put each app in a separate container and launch everything on one node with time-slicing then for the GPU it’ll look the same as that Linux desktop?

@P_Ramarao
This explanation means that the total memory usage of multiple containers/processes cannot exceed the GPU’s memory limit. However, different containers/processes can utilize all the SMs within their scheduled time slices.

This capability is not much different from multiple containers/processes directly shared-running on a single GPU. In this scenario, the memory usage of processes also cannot exceed the GPU’s memory limit, but the GPU tasks’ blocks of the processes cannot utilize all the SMs during scheduling.

Often, we want to use time slices to allow a process/container to utilize the entire GPU’s computing SMs and memory within its time slice. Therefore, we hope that when one process/container occupies a time slice, the memory of other processes/containers can be migrated to the CPU memory. Is this capability supported?