Broken GPU state query failure in AMD + H100

ajannu · January 26, 2024, 12:43am

The GPU device constantly fails to be queried correctly in the host. I first observe this error:

nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
2024-01-26,00:23:11.943 WARNING  GPU 0000:21:00.0 ? 0x2331 BAR0 0x3d042000000 not in D0 (current state 3), forcing it to D0
Topo:
  PCI 0000:20:01.1 0x1022:0x14ab
   GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:23:11.995 INFO     Selected GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:23:11.996 WARNING  GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 has CC mode on, some functionality may not work
Traceback (most recent call last):
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2499, in <module>
    main()
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2473, in main
    cc_settings = gpu.query_cc_settings()
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2284, in query_cc_settings
    knob_value = self.fsp_rpc.prc_knob_read(knob_id)
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1915, in prc_knob_read
    data = self.prc_cmd([prc])
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1883, in prc_cmd
    self.poll_for_msg_queue()
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1845, in poll_for_msg_queue
    raise GpuError(f"Timed out polling for {self.npu.name} message queue on channel {self.channel_num}. head {mhead} == tail {mtail}")
__main__.GpuError: Timed out polling for fsp message queue on channel 2. head 4294967295 == tail 4294967295
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$

I try unbinding and rebinding the device after starting the guest confidential VM and shutting down the guest CVM as indicated in the deployment guide. I can then query the gpu device:

nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
Topo:
  PCI 0000:20:01.1 0x1022:0x14ab
   GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:27:56.041 INFO     Selected GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:27:56.041 WARNING  GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 has CC mode on, some functionality may not work
2024-01-26,00:28:00.009 INFO     GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 CC settings:
2024-01-26,00:28:00.009 INFO       enable = 1
2024-01-26,00:28:00.009 INFO       enable-devtools = 0
2024-01-26,00:28:00.009 INFO       enable-allow-inband-control = 1
2024-01-26,00:28:00.009 INFO       enable-devtools-allow-inband-control = 1
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$

A few minutes after starting the guest CVM, I get this error in the host:

nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
2024-01-26,00:31:00.759 WARNING  GPU 0000:21:00.0 ? 0x2331 BAR0 0x3d042000000 not in D0 (current state 3), forcing it to D0
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 127, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2092, in __init__
    raise BrokenGpuError()
2024-01-26,00:31:00.877 ERROR    GPU /sys/bus/pci/devices/0000:21:00.0 broken: 
2024-01-26,00:31:00.901 ERROR    Config space working True
Traceback (most recent call last):
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 127, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2092, in __init__
    raise BrokenGpuError()
__main__.BrokenGpuError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2499, in <module>
    main()
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2436, in main
    gpus, other = find_gpus()
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 146, in find_gpus
    return find_gpus_sysfs(bdf)
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 137, in find_gpus_sysfs
    dev = BrokenGpu(dev_path=dev_path)
  File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1269, in __init__
    self.bars_configured = self.sanity_check_cfg_space_bars()
AttributeError: 'BrokenGpu' object has no attribute 'sanity_check_cfg_space_bars'. Did you mean: 'sanity_check_cfg_space'?
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$

This is happening way too often and it is becoming hard to use the CVM.

rnertney · January 26, 2024, 1:01am

You cannot perform these actions to the GPU device while it is bound to VFIO. You mentioned that binding/unbinding lets you query from the host, but then later you start to see issues in the host again.

Are you attempting to query the CC mode from the host side? This isn’t the intended use model for CC, as the host/hypervisor layer is considered untrustworthy. Checking for continual CC Mode should be done in the guest via:

$ nvidia-smi conf-compute -f
CC status: ON

However, the best practice if you are checking for security reasons is to modify the attestation scripts to occasionally re-attest the GPU from within the guest. This flow should be designed to operate while in mission-mode code operation, too.

ajannu · January 26, 2024, 1:52am

Okay I can continue to check the CC mode from within the guest environment instead of the host.

But in the guest environment, I often get these errors even though the CC mode in the host was okay before I started the guest VM:

nvidia@hccvm:~$ sudo nvidia-smi conf-compute -srs 1 
[sudo] password for nvidia: 
No devices were found
nvidia@hccvm:~$ sudo dmesg | grep nvidia
[    8.012261] nvidia: loading out-of-tree module taints kernel.
[    8.012863] nvidia: module license 'NVIDIA' taints kernel.
[    8.025948] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    8.104744] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    8.121577] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  545.23.08  Mon Nov  6 23:23:07 UTC 2023
[    8.124168] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    8.124725] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[    9.041582] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[    9.075399] nvidia-uvm: Loaded the UVM driver, major device number 237.
[    9.184542] audit: type=1400 audit(1706233220.384:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=733 comm="apparmor_parser"
[    9.184545] audit: type=1400 audit(1706233220.384:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=733 comm="apparmor_parser"
[   11.035984] nvidia 0000:01:00.0: swiotlb buffer is full (sz: 368640 bytes), total 524288 (slots), used 693 (slots)
[   11.123591] nvidia 0000:01:00.0: swiotlb buffer is full (sz: 368640 bytes), total 524288 (slots), used 630 (slots)
[  501.304679] nvidia 0000:01:00.0: swiotlb buffer is full (sz: 368640 bytes), total 524288 (slots), used 619 (slots)
[  501.384436] nvidia 0000:01:00.0: swiotlb buffer is full (sz: 368640 bytes), total 524288 (slots), used 619 (slots)
nvidia@hccvm:~$ nvidia-smi conf-compute -f 
No devices were found
nvidia@hccvm:~$

rnertney · January 26, 2024, 1:59am

You can attempt to reset the GPU (it might be locked down due to the interference from the host while bound to a VM) with these steps:

(These are from memory as I’m not in front of a computer w/CVM+GPU available)

$ sudo su
$ echo 10de 2331 > /sys/bus/pci/drivers/vfio_pci/unbind
$ echo 1 > /sys/bus/pci/devices/0000:21:00.0/reset
$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --set-cc-mode on --reset-after-cc-mode-switch
$ echo 10de 2331 > /sys/bus/pci/drivers/vfio_pci/bind
$ sudo ./launch_vm.sh

If this doesn’t work, run a sudo reboot from the host and try the steps to launch the CVM again, but taking care to not toggle/query the GPU when it’s bound to VFIO from the host.

Yifan-Tan · January 26, 2024, 3:45am

[    8.121577] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  545.23.08  Mon Nov  6 23:23:07 UTC 2023

It seems that you installed Nvidia driver version 545.23.08?

According to CUDA 12.3 Update 2 Release Notes :

The Early Access (EA) of Hopper Confidential Computing is not enabled on 12.3 or its associated driver (545.xx). Please see https://docs.nvidia.com/confidential-computing/ for details.

Please follow https://docs.nvidia.com/confidential-computing-deployment-guide.pdf to install Nvidia driver with 535.86.10 version on the guest VM. Note that -m=kernel-open should be added.

ajannu · January 31, 2024, 7:17am

Re-installing the right driver and resetting the device on host worked just once. We continue to face other issues from the driver. Here are some kernel logs, do you see why this could be happening?

Logs observed when trying to use nvidia-smi:

Jan 31 06:40:08 hccvm systemd-timesyncd[724]: Timed out waiting for reply from 185.125.190.56:123 (ntp.ubuntu.com).
Jan 31 06:40:09 hccvm systemd-resolved[765]: Clock change detected. Flushing caches.
Jan 31 06:40:09 hccvm systemd-timesyncd[724]: Initial synchronization to time server 185.125.190.58:123 (ntp.ubuntu.com).
Jan 31 06:40:39 hccvm kernel: [   92.112758] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
Jan 31 06:40:43 hccvm kernel: [   96.318265] nvidia-uvm: Loaded the UVM driver, major device number 237.
Jan 31 06:40:43 hccvm kernel: [   96.367798] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1e00005
Jan 31 06:40:55 hccvm kernel: [  107.952175] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
Jan 31 06:40:55 hccvm kernel: [  107.952200] NVRM osInitNvMapping: *** Cannot attach gpu
Jan 31 06:40:55 hccvm kernel: [  107.952202] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
Jan 31 06:40:55 hccvm kernel: [  107.952223] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
Jan 31 06:40:55 hccvm kernel: [  107.953483] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jan 31 06:40:55 hccvm kernel: [  108.025939] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
Jan 31 06:40:55 hccvm kernel: [  108.025962] NVRM osInitNvMapping: *** Cannot attach gpu
Jan 31 06:40:55 hccvm kernel: [  108.025964] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
Jan 31 06:40:55 hccvm kernel: [  108.025985] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
Jan 31 06:40:55 hccvm kernel: [  108.027166] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Logs seen when trying to start nvidia-persistenced service:

Jan 31 06:55:00 hccvm nvidia-persistenced: Started (25254)
Jan 31 06:55:01 hccvm kernel: [  953.720362] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:2331)
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: installed in this system is not supported by the
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: NVIDIA 545.23.08 driver release.
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: in this release's README, available on the operating system
Jan 31 06:55:01 hccvm kernel: [  953.720369] NVRM: specific graphics driver download page at www.nvidia.com.
Jan 31 06:55:01 hccvm kernel: [  953.725420] nvidia: probe of 0000:01:00.0 failed with error -1
Jan 31 06:55:01 hccvm kernel: [  953.725439] NVRM: The NVIDIA probe routine failed for 1 device(s).
Jan 31 06:55:01 hccvm kernel: [  953.725987] NVRM: None of the NVIDIA devices were initialized.
Jan 31 06:55:01 hccvm kernel: [  953.726692] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
Jan 31 06:55:01 hccvm systemd-udevd[24634]: nvidia: Process '/sbin/modprobe nvidia-uvm' failed with exit code 1.

Yifan-Tan · January 31, 2024, 10:27am

The driver is still 545.23.08. Consider uninstalling the driver of this version in the guest.

ajannu · January 31, 2024, 7:14pm

I was unable to start nvidia-persistenced or run nvidia-smi, I realized this was due to an incorrect driver version which was installed. I re-installed the correct version of the driver and toolkit as mentioned in the deployment guide (cuda_12.2.1 _535.86.10_linux.run).
I was able to run nvidia-persistenced after that, but when I run nvidia-smi conf-compute -srs 1, I got an error indicating no devices found. I check the dmesg logs, this is what I see:

[  474.757140] nvidia-uvm: Loaded the UVM driver, major device number 237.
[  474.762333] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.86.10  Release Build  (dvs-builder@U16-I2-C05-14-2)  Wed Jul 26 23:05:16 UTC 2023
[  474.763534] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  474.763536] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[  474.766346] [drm] [nvidia-drm] [GPU ID 0x00000100] Unloading driver
[  474.812228] nvidia-modeset: Unloading
[  474.957832] nvidia-uvm: Unloaded the UVM driver.
[  474.985534] nvidia-nvlink: Unregistered Nvlink Core, major device number 239
[  481.375587] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[  481.375592] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  535.86.10  Release Build  (dvs-builder@U16-I2-C05-14-2)  Wed Jul 26 23:15:31 UTC 2023
[  481.388372] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.86.10  Release Build  (dvs-builder@U16-I2-C05-14-2)  Wed Jul 26 23:05:16 UTC 2023
[  481.389479] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  481.389481] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[  568.791218] ACPI Warning: \_SB.PCI0.S20.S00._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
[  572.875011] nvidia-uvm: Loaded the UVM driver, major device number 237.
[  573.041217] NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1e00005
[  687.029854] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  687.029872] NVRM osInitNvMapping: *** Cannot attach gpu
[  687.029874] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  687.029892] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  687.031062] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  687.104653] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  687.104672] NVRM osInitNvMapping: *** Cannot attach gpu
[  687.104675] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  687.104691] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  687.105864] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  698.965218] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  698.965238] NVRM osInitNvMapping: *** Cannot attach gpu
[  698.965239] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  698.965260] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  698.966587] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  699.052822] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  699.052843] NVRM osInitNvMapping: *** Cannot attach gpu
[  699.052845] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  699.052866] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  699.054097] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  986.377921] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  986.377939] NVRM osInitNvMapping: *** Cannot attach gpu
[  986.377941] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  986.377959] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  986.379375] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  986.453714] NVRM gpumgrCheckRmFirmwarePolicy: Disabling GSP offload -- GPU not supported
[  986.453728] NVRM osInitNvMapping: *** Cannot attach gpu
[  986.453730] NVRM RmInitAdapter: osInitNvMapping failed, bailing out of RmInitAdapter
[  986.453742] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x56:631)
[  986.454857] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

@rnertney

rnertney · January 31, 2024, 9:54pm

The driver version supported is 535.104.05.

After we go GA, every enterprise recommended driver will support CC modes. Please perform these steps:

# On the Guest:

$ wget https://developer.download.nvidia.com/compute/cuda/12.2.2/local_installers/cuda_12.2.2_535.104.05_linux.run

# The runfile wants a few prerequisites
$ sudo apt install gcc g++ make

# Install the driver. Accept the EULA, and accept the default options
$ sudo sh cuda_12.2.2_535.104.05_linux.run -m=kernel-open

Yifan-Tan · February 1, 2024, 5:03am

Please use gpu_cc_tool.py on the host to check if the CC status is on (enable = 1)

system · February 15, 2024, 5:03am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	61973	February 14, 2021
[370.28] with kernel [4.8] on >=2015 machines: driver claims card not supported if nvidia is not primary card Linux	37	21373	September 26, 2017
Ubuntu tesla P40 NVRM: GPU 0000:03:00.0: RmInitAdapter Drivers - Linux, Windows, MacOS kernel , nvbugs	4	1371	March 31, 2023
Bug Report: GPU Driver Hang with Specific Workloads on H100 and Nvidia 550, 555 Linux	3	446	August 21, 2024
RmInitAdapter failed! since kernel > 6.4 Linux kernel	28	2682	November 5, 2024
NVRM: failed to copy vbios to system memory Linux	36	10856	September 29, 2024
GPU loss Linux	7	13643	April 3, 2019
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	205	September 11, 2024
[OpenSuse] HP Omen 17-w005np can’t initialize GTX 965M Linux	1	540	October 30, 2020
'nvidia-smi' does not discover the GPU on a GH200 Supermicro system running Ubuntu 22.04 Linux	6	703	April 26, 2024

Broken GPU state query failure in AMD + H100

Related topics