Hi, we have a H100 GPU cluster with Cuda 12.0 and another A100 cluster with Cuda 11.6 installed. We orchestrate the clusters using Kubernetes and provide our internal clients a base image with Cuda toolkit installed, and let users to build their training images on top of this base image.
We find that
when install Cuda toolkit 11.8 in our base image and running pods on the A100 cluster, nvidia-smi (we execute into the pod and run this command) shows Cuda version is 11.8 which is expected.
However, when running pods on H100 clusters with image installed 11.8 Cuda toolkit, the Cuda version is 12.0 (we also execute into the pod and run this command)
I am a bit confused because I always think the Cuda toolkit installed in our base image will override the toolkit pulled from Nvidia container runtime.