Run Cuda Application on Kubernetes

Hi, we have a H100 GPU cluster with Cuda 12.0 and another A100 cluster with Cuda 11.6 installed. We orchestrate the clusters using Kubernetes and provide our internal clients a base image with Cuda toolkit installed, and let users to build their training images on top of this base image.

We find that

  • when install Cuda toolkit 11.8 in our base image and running pods on the A100 cluster, nvidia-smi (we execute into the pod and run this command) shows Cuda version is 11.8 which is expected.

  • However, when running pods on H100 clusters with image installed 11.8 Cuda toolkit, the Cuda version is 12.0 (we also execute into the pod and run this command)

I am a bit confused because I always think the Cuda toolkit installed in our base image will override the toolkit pulled from Nvidia container runtime.

nvidia-smi is part of the driver installation. The CUDA version reported by nvidia-smi is the maximum CUDA version supported by the installed driver. It does not indicate anything about which version of CUDA is installed (if any).

For example, on the machine I am typing this on, nvidia-smi reports 12.0 as the maximum supported CUDA version, while the latest version of CUDA actually installed is 11.8.