Bright 9.2 on ubuntu, unable to update cuda to one of the images

I manage a mid size cluster (around 40 nodes) and it has been requested to update the nvidia gpu drivers. I did this in the past and it used to be an easy task. What I generally do is:

  1. head node: apt update, apt install cuda12.3-sdk cuda12.3-toolkit

  2. then I clone the image I need to update and I chroot into it, then i run:
    chrooted image: apt update; apt install cuda-driver cuda-dcgm

  3. I change the image of the category which the node belongs to and I reboot it

well: I did this in the past and it workede like a charm but now the installation process hangs. Please have a look

Tue Jul 2 16:21:23 2024 [notice] hnode01: Initial ramdisk for node gnode01 base d on image gpu-image-2024 was generated successfully
[hnode01->device]%
Tue Jul 2 16:25:46 2024 [notice] hnode01: gnode01 [ INSTALLING ] (node installer started)
[hnode01->device]%
Tue Jul 2 16:27:10 2024 [notice] hnode01: gnode01 [ INSTALLER_CALLINGINIT ] (sw itching to local root)
[hnode01->device]%
Tue Jul 2 16:37:10 2024 [notice] hnode01: gnode01 [ INSTALLER_UNREACHABLE ] (sw itching to local root)
[hnode01->device]%

the process never ends, roles (ie: slurm client) aren’t applied. I switched back to the original image I built when I installed the cluster and it works.
I have tried 5 times, the same process ends with this disastrous end.

Please note, those are the package versions I got form bright:

root@gpu-image-2024:/# apt list -a cuda-driver cuda-dcgm Listing… Done
cuda-dcgm/CM 9.2 1:3.1.3.1-198-cm9.2 amd64 [upgradable from: 1:3.1.3.1-172-cm9.2]
cuda-dcgm/CM 9.2,now 1:3.1.3.1-172-cm9.2 amd64 [installed,upgradable to: 1:3.1.3.1-198-cm9.2]
cuda-dcgm/CM 9.2 1:2.4.6.1-161-cm9.2 amd64
cuda-dcgm/CM 9.2 1:2.3.5.1-155-cm9.2 amd64
cuda-dcgm/CM 9.2 1:2.3.5.1-148-cm9.2 amd64
cuda-dcgm/CM 9.2 1:2.3.5.1-142-cm9.2 amd64

cuda-driver/CM 9.2 550.54.15-767-cm9.2 amd64 [upgradable from: 525.60.13-661-cm9.2]
cuda-driver/CM 9.2 535.129.03-738-cm9.2 amd64
cuda-driver/CM 9.2 530.30.02-711-cm9.2 amd64
cuda-driver/CM 9.2 530.30.02-682-cm9.2 amd64
cuda-driver/CM 9.2 525.85.12-665-cm9.2 amd64
cuda-driver/CM 9.2,now 525.60.13-661-cm9.2 amd64 [installed,upgradable to: 550.54.15-767-cm9.2]
cuda-driver/CM 9.2 520.61.05-640-cm9.2 amd64
cuda-driver/CM 9.2 515.65.01-638-cm9.2 amd64
cuda-driver/CM 9.2 515.65.01-636-cm9.2 amd64
cuda-driver/CM 9.2 515.43.04-609-cm9.2 amd64
cuda-driver/CM 9.2 510.47.03-600-cm9.2 amd64
cuda-driver/CM 9.2 510.39.01-595-cm9.2 amd64

root@gpu-image-2024:/#

Hi Davide,

Likely a good idea to send in a support request vs the forum. You can just cut and paste the text you’ve included here into the ticket.

Entering a case is easy: ESPCommunity

My teams don’t monitor or provide support here on the developer forum, but this is absolutely something we can assist with.

Cheers,
kw


Ken Woods
Worldwide Manager, Nvidia BCM Support
Direct: +31 61 185 8321
kwoods@nvidia.com

thanks, I opened a case as suggested. We have another profile of nodes with amd gpus. I repeated that same process and it worked as expected.