Broken GPU due to PCI

Hi,

UPDATE
This problem is fixed through cold-reboot.


I’m using a H100 NVL GPU, kernel is 6.2.0-mvp10v1+8-generic #mvp10v1+tdx and CC mode used to be successfully enabled and tested. But recently, when i tried to unbind_pci (after turn-on cc into dev), it doesn’t response for a long time (hangs at pci operations, e.g., bind, remove.), then i reboot the system.

After reboot, the nvidia-smi shows the right thing, and i can run application in normal mode. But when i try to query/trun-on cc mode, it shows:

NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=5a:00.0', '--query-cc-mode']
  File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 3665, in __init__
    self.bar1 = self._map_bar(1, 1024 * 1024)
  File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 1577, in _map_bar
    return FileMap("/dev/mem", bar_addr, bar_size)
  File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 193, in __init__
    mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2024-10-24,17:32:18.984 ERROR    GPU /sys/bus/pci/devices/0000:5a:00.0 broken: [Errno 11] Resource temporarily unavailable
2024-10-24,17:32:18.986 ERROR    Config space working True
Topo:
  Intel root port 0000:59:01.0
   GPU 0000:5a:00.0 ? 0x2321 BAR0 0x9e042000000
   GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
2024-10-24,17:32:18.986 INFO     Selected GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken ...
2024-10-24,17:32:18.986 ERROR    GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=5a:00.0', '--set-cc-mode=devtools', '--reset-after-cc-mode-switch']
  File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
    dev = Gpu(dev_path=dev_path)
  File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 3665, in __init__
    self.bar1 = self._map_bar(1, 1024 * 1024)
  File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 1577, in _map_bar
    return FileMap("/dev/mem", bar_addr, bar_size)
  File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 193, in __init__
    mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2024-10-24,17:44:39.670 ERROR    GPU /sys/bus/pci/devices/0000:5a:00.0 broken: [Errno 11] Resource temporarily unavailable
2024-10-24,17:44:39.673 ERROR    Config space working True
Topo:
  Intel root port 0000:59:01.0
   GPU 0000:5a:00.0 ? 0x2321 BAR0 0x9e042000000
   GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
2024-10-24,17:44:39.673 INFO     Selected GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken ...
2024-10-24,17:44:39.673 ERROR    GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.

I tried to use recover-broken option, but it get stuck again at main()

                # Reset the GPU with SBR and if successful,
                # remove and rescan it to recover BARs
                if gpu.reset_with_sbr():
                    gpu.sysfs_remove() # <------- stuck here, no response for a long time
                    sysfs_pci_rescan()
                    gpu.reinit()
                    if gpu.is_broken_gpu():
                        error("Failed to recover %s", gpu)
                        sys.exit(1)
                    else:
                        info("Recovered %s", gpu)

And this is the unbind_pci script i’m using:

# the gpu with cc on
gpu="0000:5a:00.0"
gpu_vd="$(cat /sys/bus/pci/devices/$gpu/vendor) $(cat /sys/bus/pci/devices/$gpu/device)"
echo "$gpu" > "/sys/bus/pci/devices/$gpu/driver/unbind"
echo "$gpu_vd" > /sys/bus/pci/drivers/vfio-pci/new_id

And there is no related information in dmesg.

When i try to do pci rescan, the kernel gives

[ 2814.304147] pcieport 0000:48:01.0: bridge window [io  0x1000-0x0fff] to [bus 49] add_size 1000
[ 2814.304158] pcieport 0000:48:03.0: bridge window [io  0x1000-0x0fff] to [bus 4a] add_size 1000
[ 2814.304162] pcieport 0000:48:05.0: bridge window [io  0x1000-0x0fff] to [bus 4b] add_size 1000
[ 2814.304173] pcieport 0000:48:01.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304176] pcieport 0000:48:01.0: BAR 13: failed to assign [io  size 0x1000]
[ 2814.304179] pcieport 0000:48:03.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304180] pcieport 0000:48:03.0: BAR 13: failed to assign [io  size 0x1000]
[ 2814.304183] pcieport 0000:48:05.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304184] pcieport 0000:48:05.0: BAR 13: failed to assign [io  size 0x1000]
[ 2814.304189] pcieport 0000:48:05.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304191] pcieport 0000:48:05.0: BAR 13: failed to assign [io  size 0x1000]
[ 2814.304193] pcieport 0000:48:03.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304194] pcieport 0000:48:03.0: BAR 13: failed to assign [io  size 0x1000]
[ 2814.304196] pcieport 0000:48:01.0: BAR 13: no space for [io  size 0x1000]
[ 2814.304198] pcieport 0000:48:01.0: BAR 13: failed to assign [io  size 0x1000]

But i don’t think that related to this GPU.

I’ve tried reboot few times, but it seems will not restore the right GPU configuration. Also i tried FLR reset, after that pci operation still hangs.

Therefore, I’m wondering is there a way to restore/flush the GPU states back to normal?

Thanks for any possible answers.

Thanks for the update!