Upgrading Nvidia DGX packages did not update CUDA version

Hello everyone,

We have a DGX box with A100 cards. We attempted to perform an upgrade on the system packages to move the DGX OS from 5.0.2 to the latest 5.4.2 version.

We followed the steps on the Nvidia documentation center to perform the upgrade via CLI:
dgx os upgrade notes

After rebooting of the system, I can see that the Nvidia drivers were successfully updated from 450.80 to 450.216.04 and the DGX OS was also successfully updated.

However, Nvidia CUDA is still on 11.0 where I was expecting to move up to 11.4 as per the release documentation (unless I misunderstood it).

Could someone please help us update also CUDA on our DGX system? Do I need to branch-off to a newer version of Nvidia drivers (e.g., 470.*) for this to happen?

Thank you very much in advance!

Hi @pedroDGX ,

Do you not see any other CUDA package versions available (via apt search or apt-cache policy for example)?

You may also consider switching your work to using NGC containers. The NGC CUDA containers are an excellent way to have a repeatable environment, and use the forward and backward compatibility of CUDA regardless of which driver versions are installed on the host.

ScottE

1 Like

Hello @ScottEllis ,

Thank you very much for your response!

I do see other CUDA versions available when I do apt search (e.g., cuda-libraries-* all the way to version 12, but the one showing with the [installed] flagged is version 11-0).

We do run our ML/DL jobs using docker, and I just learned about the forward compatibility feature!

I was a bit confused before. I thought I needed to have a CUDA version on the host (DGX system) at least as high as that of the DL library I wanted to run on my container (e.g., if I have CUDA 11.0 on the host, I can only run PyTorch built against CUDA 11.0 or built against an earlier version of CUDA on the container).

I just grabbed a container with CUDA 11.6 and cudnn8, installed PyTorch 1.13 (built against CUDA 11.6) on top, and everything works well.

Following this example, I guess the forward compatibility means I can run PyTorch 1.13 (built against CUDA 11.6) on any container with CUDA 11.*, right?

Thank you very much!!

Exactly @pedroDGX ! The beauty of using containers in this model is that the host only needs to have the GPU driver installed. You can then run almost any version of CUDA inside a Pytorch container - the CUDA compatibility let’s it work with older or newer drivers on the host.

Most DGX users don’t even install CUDA in the host OS - that’s not used if work is done in containers. :-)

ScottE

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.