Ppcie mode nvswich not working in cvm

systemctl status nvidia-fabricmanager.service
× nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: enabled)
Active: failed (Result: exit-code) since Sun 2025-07-20 07:11:52 UTC; 25min ago
Duration: 4min 19.695s
Process: 5552 ExecStart=/usr/bin/nvidia-fabricmanager-start.sh $FM_CONFIG_FILE $FM_PID_FILE $NVLSM_CONFIG_FILE $NVLSM_PID_FILE (code=exited, status=1/FAILURE)
CPU: 37ms

Jul 20 07:11:52 cvm-oka47ws4 systemd[1]: Starting nvidia-fabricmanager.service - NVIDIA fabric manager service…
Jul 20 07:11:52 cvm-oka47ws4 nvidia-fabricmanager-start.sh[5552]: Detected Pre-NVL5 system
Jul 20 07:11:52 cvm-oka47ws4 nvidia-fabricmanager-start.sh[5559]: failed to acquire required privileges to access NVSwitch devices. make sure fabric manager has access permissions to required devic>
Jul 20 07:11:52 cvm-oka47ws4 nvidia-fabricmanager-start.sh[5552]: “/usr/bin/nv-fabricmanager” failed! Exit code: 1
Jul 20 07:11:52 cvm-oka47ws4 systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Jul 20 07:11:52 cvm-oka47ws4 systemd[1]: nvidia-fabricmanager.service: Failed with result ‘exit-code’.
Jul 20 07:11:52 cvm-oka47ws4 systemd[1]: Failed to start nvidia-fabricmanager.service - NVIDIA fabric manager service.

in /var/log/fabricmanager, we found an error
[Jul 20 2025 07:08:54] [INFO] [tid 3921] NVLink inband GPU probe request received on Switch NodeId 0 Switch Id 2 port 35 from Compute NodeId 0 GPU Id 6 port 16.
[Jul 20 2025 07:08:54] [INFO] [tid 3921] added GPU with UUID xxx based on NVLink Inband GPU probe request.
[Jul 20 2025 07:08:54] [INFO] [tid 3921] Compute NodeId: 0 GPU ID: 6 Index: 2048 Handle: 14f20494 PCI Bus ID: 00000000:08:00.0 Discovered Link Mask: 3ffff Enabled Link Mask: 3ffff UUID: xxx Target ID: 5

[Jul 20 2025 07:08:54] [ERROR] [tid 3921] All the GPUs probe requests are received, initiating protected PCIe NVSwitch lock state for NodeId 0.
[Jul 20 2025 07:08:54] [INFO] [tid 3897] Sending inband response message: Message header details: magic Id:adbc request Id:185579a63a59dad0 status:0 type:1 length:66
Message payload details: Probe response: Handle:3d5ae5c114f20494 GfId:0 FM Caps:27 Cluster Uuid:00000000-0000-0000-0000-000000000000 Fabric Partition Id:ffff GPA Address:8250000000000 GPA Address Range:8000000000 FLA Address:250000000000 FLA Address Range:8000000000 Fabric Clique ID:0 Fabric Health Mask:0

Please check whether ppcie mode is enabled and the VBIOS version meets the requirement.

nvidia-smi -q | grep -i vbios
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
VBIOS Version : 96.00.CF.00.02
./nvidia_gpu_tools.py --query-ppcie-mode --gpu 0 | grep -i ppcie
Command line arguments: [‘./nvidia_gpu_tools.py’, ‘–query-ppcie-mode’, ‘–gpu’, ‘0’]
2025-07-21,02:29:35.447 WARNING NvSwitch 0000:05:00.0 ? 0x22a3 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.452 WARNING NvSwitch 0000:06:00.0 ? 0x22a3 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.455 WARNING NvSwitch 0000:07:00.0 ? 0x22a3 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.458 WARNING NvSwitch 0000:08:00.0 ? 0x22a3 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.464 WARNING GPU 0000:19:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.471 WARNING GPU 0000:2a:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.476 WARNING GPU 0000:3b:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.480 WARNING GPU 0000:5d:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.485 WARNING GPU 0000:9b:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.491 WARNING GPU 0000:ab:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.495 WARNING GPU 0000:bb:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.501 WARNING GPU 0000:db:00.0 ? 0x2335 BAR0 0x0 was in D0/control:auto, forced power control to on. New state D0
2025-07-21,02:29:35.502 INFO Selected NvSwitch 0000:05:00.0 NVSwitch_gen3 0x22a3 BAR0 0x9a000000
2025-07-21,02:29:35.502 WARNING NvSwitch 0000:05:00.0 NVSwitch_gen3 0x22a3 BAR0 0x9a000000 has PPCIe mode on, some functionality may not work
2025-07-21,02:29:35.502 INFO NvSwitch 0000:05:00.0 NVSwitch_gen3 0x22a3 BAR0 0x9a000000 PPCIe mode is on
2025-07-21,02:29:35.503 WARNING GPU 0000:db:00.0 H200 0x2335 BAR0 0x43c042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:bb:00.0 H200 0x2335 BAR0 0x3bc042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:ab:00.0 H200 0x2335 BAR0 0x37c042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:9b:00.0 H200 0x2335 BAR0 0x33c042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:5d:00.0 H200 0x2335 BAR0 0x1bc042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:3b:00.0 H200 0x2335 BAR0 0x13c042000000 restoring power control to auto
2025-07-21,02:29:35.503 WARNING GPU 0000:2a:00.0 H200 0x2335 BAR0 0xfc042000000 restoring power control to auto
2025-07-21,02:29:35.504 WARNING GPU 0000:19:00.0 H200 0x2335 BAR0 0xbc042000000 restoring power control to auto
2025-07-21,02:29:35.504 WARNING NvSwitch 0000:08:00.0 NVSwitch_gen3 0x22a3 BAR0 0x94000000 restoring power control to auto
2025-07-21,02:29:35.504 WARNING NvSwitch 0000:07:00.0 NVSwitch_gen3 0x22a3 BAR0 0x96000000 restoring power control to auto
2025-07-21,02:29:35.504 WARNING NvSwitch 0000:06:00.0 NVSwitch_gen3 0x22a3 BAR0 0x98000000 restoring power control to auto
2025-07-21,02:29:35.504 WARNING NvSwitch 0000:05:00.0 NVSwitch_gen3 0x22a3 BAR0 0x9a000000 restoring power control to auto

2025-07-21,02:29:40.205 INFO Selected NvSwitch 0000:06:00.0 NVSwitch_gen3 0x22a3 BAR0 0x98000000
2025-07-21,02:29:40.206 WARNING NvSwitch 0000:06:00.0 NVSwitch_gen3 0x22a3 BAR0 0x98000000 has PPCIe mode on, some functionality may not work
2025-07-21,02:29:40.206 INFO NvSwitch 0000:06:00.0 NVSwitch_gen3 0x22a3 BAR0 0x98000000 PPCIe mode is on

2025-07-21,02:29:45.534 INFO Selected NvSwitch 0000:07:00.0 NVSwitch_gen3 0x22a3 BAR0 0x96000000
2025-07-21,02:29:45.535 WARNING NvSwitch 0000:07:00.0 NVSwitch_gen3 0x22a3 BAR0 0x96000000 has PPCIe mode on, some functionality may not work
2025-07-21,02:29:45.535 INFO NvSwitch 0000:07:00.0 NVSwitch_gen3 0x22a3 BAR0 0x96000000 PPCIe mode is on

2025-07-21,02:29:50.471 INFO Selected NvSwitch 0000:08:00.0 NVSwitch_gen3 0x22a3 BAR0 0x94000000
2025-07-21,02:29:50.472 WARNING NvSwitch 0000:08:00.0 NVSwitch_gen3 0x22a3 BAR0 0x94000000 has PPCIe mode on, some functionality may not work
2025-07-21,02:29:50.472 INFO NvSwitch 0000:08:00.0 NVSwitch_gen3 0x22a3 BAR0 0x94000000 PPCIe mode is on

Thanks! Please check whether you are using nvidia 570 or 575 open kernel driver in the guest VM


So nvidia-smi could run smoothly in guest? Did you execute nvidia-persistenced --uvm-persistence-mode?

nvidia-smi is ok, and persistence-mode is on

Have you tried executing cuda programs on a single GPUs and on multiple GPUs (e.g., P2P memory copy)?