The GPU device constantly fails to be queried correctly in the host. I first observe this error:
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
2024-01-26,00:23:11.943 WARNING GPU 0000:21:00.0 ? 0x2331 BAR0 0x3d042000000 not in D0 (current state 3), forcing it to D0
Topo:
PCI 0000:20:01.1 0x1022:0x14ab
GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:23:11.995 INFO Selected GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:23:11.996 WARNING GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 has CC mode on, some functionality may not work
Traceback (most recent call last):
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2499, in <module>
main()
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2473, in main
cc_settings = gpu.query_cc_settings()
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2284, in query_cc_settings
knob_value = self.fsp_rpc.prc_knob_read(knob_id)
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1915, in prc_knob_read
data = self.prc_cmd([prc])
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1883, in prc_cmd
self.poll_for_msg_queue()
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1845, in poll_for_msg_queue
raise GpuError(f"Timed out polling for {self.npu.name} message queue on channel {self.channel_num}. head {mhead} == tail {mtail}")
__main__.GpuError: Timed out polling for fsp message queue on channel 2. head 4294967295 == tail 4294967295
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$
I try unbinding and rebinding the device after starting the guest confidential VM and shutting down the guest CVM as indicated in the deployment guide. I can then query the gpu device:
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
Topo:
PCI 0000:20:01.1 0x1022:0x14ab
GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:27:56.041 INFO Selected GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000
2024-01-26,00:27:56.041 WARNING GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 has CC mode on, some functionality may not work
2024-01-26,00:28:00.009 INFO GPU 0000:21:00.0 H100-PCIE 0x2331 BAR0 0x3d042000000 CC settings:
2024-01-26,00:28:00.009 INFO enable = 1
2024-01-26,00:28:00.009 INFO enable-devtools = 0
2024-01-26,00:28:00.009 INFO enable-allow-inband-control = 1
2024-01-26,00:28:00.009 INFO enable-devtools-allow-inband-control = 1
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$
A few minutes after starting the guest CVM, I get this error in the host:
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$ sudo python3 gpu_cc_tool.py --gpu-name=H100 --query-cc-settings
NVIDIA GPU Tools version 535.86.06
2024-01-26,00:31:00.759 WARNING GPU 0000:21:00.0 ? 0x2331 BAR0 0x3d042000000 not in D0 (current state 3), forcing it to D0
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 127, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2092, in __init__
raise BrokenGpuError()
2024-01-26,00:31:00.877 ERROR GPU /sys/bus/pci/devices/0000:21:00.0 broken:
2024-01-26,00:31:00.901 ERROR Config space working True
Traceback (most recent call last):
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 127, in find_gpus_sysfs
dev = Gpu(dev_path=dev_path)
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2092, in __init__
raise BrokenGpuError()
__main__.BrokenGpuError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2499, in <module>
main()
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 2436, in main
gpus, other = find_gpus()
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 146, in find_gpus
return find_gpus_sysfs(bdf)
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 137, in find_gpus_sysfs
dev = BrokenGpu(dev_path=dev_path)
File "/data/shared/nvtrust/host_tools/python/gpu_cc_tool.py", line 1269, in __init__
self.bars_configured = self.sanity_check_cfg_space_bars()
AttributeError: 'BrokenGpu' object has no attribute 'sanity_check_cfg_space_bars'. Did you mean: 'sanity_check_cfg_space'?
nvidia@TRY-27360-gpu01:/data/shared/nvtrust/host_tools/python$
This is happening way too often and it is becoming hard to use the CVM.