Hi,
UPDATE
This problem is fixed through cold-reboot.
I’m using a H100 NVL GPU, kernel is 6.2.0-mvp10v1+8-generic #mvp10v1+tdx
and CC mode used to be successfully enabled and tested. But recently, when i tried to unbind_pci (after turn-on cc into dev), it doesn’t response for a long time (hangs at pci operations, e.g., bind, remove.), then i reboot the system.
After reboot, the nvidia-smi shows the right thing, and i can run application in normal mode. But when i try to query/trun-on cc mode, it shows:
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=5a:00.0', '--query-cc-mode']
File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 3665, in __init__
self.bar1 = self._map_bar(1, 1024 * 1024)
File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 1577, in _map_bar
return FileMap("/dev/mem", bar_addr, bar_size)
File "/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 193, in __init__
mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2024-10-24,17:32:18.984 ERROR GPU /sys/bus/pci/devices/0000:5a:00.0 broken: [Errno 11] Resource temporarily unavailable
2024-10-24,17:32:18.986 ERROR Config space working True
Topo:
Intel root port 0000:59:01.0
GPU 0000:5a:00.0 ? 0x2321 BAR0 0x9e042000000
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
2024-10-24,17:32:18.986 INFO Selected GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken ...
2024-10-24,17:32:18.986 ERROR GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.
NVIDIA GPU Tools version v2024.02.14o
Command line arguments: ['./nvidia_gpu_tools.py', '--gpu-bdf=5a:00.0', '--set-cc-mode=devtools', '--reset-after-cc-mode-switch']
File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 110, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 3665, in __init__
self.bar1 = self._map_bar(1, 1024 * 1024)
File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 1577, in _map_bar
return FileMap("/dev/mem", bar_addr, bar_size)
File "/p/confidentialgpu/hcc/gpu-admin-tools/./nvidia_gpu_tools.py", line 193, in __init__
mapped = mmap.mmap(f.fileno(), size, mmap.MAP_SHARED, prot, offset=offset)
2024-10-24,17:44:39.670 ERROR GPU /sys/bus/pci/devices/0000:5a:00.0 broken: [Errno 11] Resource temporarily unavailable
2024-10-24,17:44:39.673 ERROR Config space working True
Topo:
Intel root port 0000:59:01.0
GPU 0000:5a:00.0 ? 0x2321 BAR0 0x9e042000000
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
2024-10-24,17:44:39.673 INFO Selected GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1]
GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken ...
2024-10-24,17:44:39.673 ERROR GPU 0000:5a:00.0 [broken, cfg space working 1 bars configured 1] is broken and --recover-broken-gpu was not specified, returning failure.
I tried to use recover-broken option, but it get stuck again at main()
# Reset the GPU with SBR and if successful,
# remove and rescan it to recover BARs
if gpu.reset_with_sbr():
gpu.sysfs_remove() # <------- stuck here, no response for a long time
sysfs_pci_rescan()
gpu.reinit()
if gpu.is_broken_gpu():
error("Failed to recover %s", gpu)
sys.exit(1)
else:
info("Recovered %s", gpu)
And this is the unbind_pci script i’m using:
# the gpu with cc on
gpu="0000:5a:00.0"
gpu_vd="$(cat /sys/bus/pci/devices/$gpu/vendor) $(cat /sys/bus/pci/devices/$gpu/device)"
echo "$gpu" > "/sys/bus/pci/devices/$gpu/driver/unbind"
echo "$gpu_vd" > /sys/bus/pci/drivers/vfio-pci/new_id
And there is no related information in dmesg.
When i try to do pci rescan, the kernel gives
[ 2814.304147] pcieport 0000:48:01.0: bridge window [io 0x1000-0x0fff] to [bus 49] add_size 1000
[ 2814.304158] pcieport 0000:48:03.0: bridge window [io 0x1000-0x0fff] to [bus 4a] add_size 1000
[ 2814.304162] pcieport 0000:48:05.0: bridge window [io 0x1000-0x0fff] to [bus 4b] add_size 1000
[ 2814.304173] pcieport 0000:48:01.0: BAR 13: no space for [io size 0x1000]
[ 2814.304176] pcieport 0000:48:01.0: BAR 13: failed to assign [io size 0x1000]
[ 2814.304179] pcieport 0000:48:03.0: BAR 13: no space for [io size 0x1000]
[ 2814.304180] pcieport 0000:48:03.0: BAR 13: failed to assign [io size 0x1000]
[ 2814.304183] pcieport 0000:48:05.0: BAR 13: no space for [io size 0x1000]
[ 2814.304184] pcieport 0000:48:05.0: BAR 13: failed to assign [io size 0x1000]
[ 2814.304189] pcieport 0000:48:05.0: BAR 13: no space for [io size 0x1000]
[ 2814.304191] pcieport 0000:48:05.0: BAR 13: failed to assign [io size 0x1000]
[ 2814.304193] pcieport 0000:48:03.0: BAR 13: no space for [io size 0x1000]
[ 2814.304194] pcieport 0000:48:03.0: BAR 13: failed to assign [io size 0x1000]
[ 2814.304196] pcieport 0000:48:01.0: BAR 13: no space for [io size 0x1000]
[ 2814.304198] pcieport 0000:48:01.0: BAR 13: failed to assign [io size 0x1000]
But i don’t think that related to this GPU.
I’ve tried reboot few times, but it seems will not restore the right GPU configuration. Also i tried FLR reset, after that pci operation still hangs.
Therefore, I’m wondering is there a way to restore/flush the GPU states back to normal?
Thanks for any possible answers.