Driver 560.X fails to initialize H100 GPUs, but previous versions work fine

On one of our GPU nodes, the kernel module of driver version 560.X (tried a few versions, including the latest 560.35.03-open and a few legacy/proprietary ones) fails to load with:

[ 28.085281] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64 560.35.03 Release Build (dvs-builder@U16-I1-N07-12-3) Fri Aug 16 21:42:42 UTC 2024
[ 28.210035] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 560.35.03 Release Build (dvs-builder@U16-I1-N07-12-3) Fri Aug 16 21:22:33 UTC 2024
[ 28.215502] [drm] [nvidia-drm] [GPU ID 0x0000e300] Loading driver
[ 28.234728] NVRM: confComputeConstructEngine_IMPL: CPU does not support confidential compute.
[ 28.234731] NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ conf_compute.c:131
[ 28.234738] NVOC: __nvoc_objDelete: Child class Spdm not freed from parent class ConfidentialCompute.NVRM: osInitNvMapping: *** Cannot attach gpu
[ 28.234754] NVRM: RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[ 28.234773] NVRM: GPU 0000:e3:00.0: RmInitAdapter failed! (0x22:0x38:744)
[ 28.235769] NVRM: GPU 0000:e3:00.0: rm_init_adapter failed, device minor number 0
[ 28.236765] [drm:nv_drm_load [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x0000e300] Failed to allocate NvKmsKapiDevice
[ 28.251341] [drm:nv_drm_register_drm_device [nvidia_drm]] ERROR [nvidia-drm] [GPU ID 0x0000e300] Failed to register device

(this error repeats for all four H100 the node has.)

That has me dumbfounded for a number of reasons:

  • Driver versions 550 or 555 work perfectly fine on that machine.
  • We have two more identical machines, i.e. exact same hardware and software configuration, and there driver version 560 works perfectly fine.
  • We do not use and don’t plan to use the Confidential Compute (CC) feature. We have never knowingly tried to enable it, and in fact, gpu-admin-tools claims it is disabled as expected (output below), with either the 560 or 555 driver version running.

From the error message, it seems quite clear to me that an assertion fails in line 131 of src/nvidia/src/kernel/gpu/conf_compute/conf_compute.c in the nvidia-open-gpu-kernel-module. And unless I’m reading the code wrong, it should only ever enter into that code when the driver thinks CC is enabled on the hardware. So why does the 560 driver hallucinate CC is enabled when it is not, on this machine and this machine only? Is there a parameter that I can give to the module to forcibly disable that useless CC crap?

The fact that it works on two identical machines makes me guess the driver is reading some uninitialized memory here. I also see that a new ‘halified’ version of gpuIsCCEnabledInHw has been added in the 560 source, so I guess the problem is in that.

I’m aware that the recommended driver version for the H100 is still 550.X, and sure, for now I could downgrade to that (while also downgrading CUDA). However I doubt this will get fixed in the 560 series before that becomes ‘production branch’ unless it’s reported - which is why I’m reporting it here.

tg102-with-560-35-03-open-1.log.gz (212.0 KB)

./nvidia_gpu_tools.py --query-cc-settings --gpu 0

NVIDIA GPU Tools version v2024.08.09o
Command line arguments: [‘./nvidia_gpu_tools.py’, ‘–query-cc-settings’, ‘–gpu’, ‘0’]
GPUs:
0 GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
1 GPU 0000:04:00.0 H100-SXM 0x2330 BAR0 0x5a042000000
2 GPU 0000:e3:00.0 H100-SXM 0x2330 BAR0 0x90042000000
3 GPU 0000:e4:00.0 H100-SXM 0x2330 BAR0 0x8a042000000
Other:
Topo:
PCI 0000:00:01.1 0x1022:0x14ab
PCI 0000:01:00.0 0x1000:0xc030
PCI 0000:02:00.0 0x1000:0xc030
GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
PCI 0000:02:01.0 0x1000:0xc030
GPU 0000:04:00.0 H100-SXM 0x2330 BAR0 0x5a042000000
PCI 0000:e0:01.1 0x1022:0x14ab
PCI 0000:e1:00.0 0x1000:0xc030
PCI 0000:e2:00.0 0x1000:0xc030
GPU 0000:e3:00.0 H100-SXM 0x2330 BAR0 0x90042000000
PCI 0000:e2:01.0 0x1000:0xc030
GPU 0000:e4:00.0 H100-SXM 0x2330 BAR0 0x8a042000000
2024-08-22,10:28:26.689 INFO Selected GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
2024-08-22,10:28:26.689 WARNING GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000 has PPCIe mode on, some functionality may not work
2024-08-22,10:28:26.757 INFO GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000 CC settings:
2024-08-22,10:28:26.757 INFO enable = 0
2024-08-22,10:28:26.757 INFO enable-devtools = 0
2024-08-22,10:28:26.757 INFO enable-bar0-filter = 0
2024-08-22,10:28:26.757 INFO enable-allow-inband-control = 1
2024-08-22,10:28:26.757 INFO enable-devtools-allow-inband-control = 1
2024-08-22,10:28:26.757 INFO enable-bar0-filter-allow-inband-control = 1

./nvidia_gpu_tools.py --query-cc-mode --gpu 0

NVIDIA GPU Tools version v2024.08.09o
Command line arguments: [‘./nvidia_gpu_tools.py’, ‘–query-cc-mode’, ‘–gpu’, ‘0’]
GPUs:
0 GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
1 GPU 0000:04:00.0 H100-SXM 0x2330 BAR0 0x5a042000000
2 GPU 0000:e3:00.0 H100-SXM 0x2330 BAR0 0x90042000000
3 GPU 0000:e4:00.0 H100-SXM 0x2330 BAR0 0x8a042000000
Other:
Topo:
PCI 0000:00:01.1 0x1022:0x14ab
PCI 0000:01:00.0 0x1000:0xc030
PCI 0000:02:00.0 0x1000:0xc030
GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
PCI 0000:02:01.0 0x1000:0xc030
GPU 0000:04:00.0 H100-SXM 0x2330 BAR0 0x5a042000000
PCI 0000:e0:01.1 0x1022:0x14ab
PCI 0000:e1:00.0 0x1000:0xc030
PCI 0000:e2:00.0 0x1000:0xc030
GPU 0000:e3:00.0 H100-SXM 0x2330 BAR0 0x90042000000
PCI 0000:e2:01.0 0x1000:0xc030
GPU 0000:e4:00.0 H100-SXM 0x2330 BAR0 0x8a042000000
2024-08-22,10:28:58.460 INFO Selected GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000
2024-08-22,10:28:58.460 WARNING GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000 has PPCIe mode on, some functionality may not work
2024-08-22,10:28:58.460 INFO GPU 0000:03:00.0 H100-SXM 0x2330 BAR0 0x60042000000 CC mode is off

i run into exact issue with a 8x H100 HGX server. Using the latest 560.35 driver and cuda 12.6.

All other NV tools, fabric-manager and persist daemon work just fine.

but nvidia-smi shows no gpu.

and the same error as show above.