Hi, we have a H100 GPU cluster with Cuda 12.0 and another A100 cluster with Cuda 11.6 installed. We orchestrate the clusters using Kubernetes and provide our internal clients a base image with Cuda toolkit installed, and let users to build their training images on top of this base image.
We find that
-
when install Cuda toolkit 11.8 in our base image and running pods on the A100 cluster, nvidia-smi (we execute into the pod and run this command) shows Cuda version is 11.8 which is expected.
-
However, when running pods on H100 clusters with image installed 11.8 Cuda toolkit, the Cuda version is 12.0 (we also execute into the pod and run this command)
I am a bit confused because I always think the Cuda toolkit installed in our base image will override the toolkit pulled from Nvidia container runtime.