Timeout waiting for RPC from GSP!

I encountered this problem with the official driver 525.85.12. Many people also encountered it in the github issue(https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446), even in the latest version. Please NVIDIA colleagues pay more attention to it. Is there a solution? Thanks!

Fri Mar  3 14:41:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:01:00.0 Off |                    0 |
| N/A   26C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A30          Off  | 00000000:22:00.0 Off |                    0 |
| N/A   40C    P0    92W / 165W |  16096MiB / 24576MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A30          Off  | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0   141W / 165W |  14992MiB / 24576MiB |     33%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  ERR!                Off  | 00000000:61:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |  23025MiB / 24576MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A30          Off  | 00000000:81:00.0 Off |                    0 |
| N/A   26C    P0    26W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A30          Off  | 00000000:A1:00.0 Off |                    0 |
| N/A   26C    P0    28W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A30          Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   25C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A30          Off  | 00000000:E1:00.0 Off |                    0 |
| N/A   24C    P0    25W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
[Fri Mar  3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar  3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar  3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar  3 04:23:22 2023] Call Trace:
[Fri Mar  3 04:23:22 2023]  dump_stack+0x6b/0x83
[Fri Mar  3 04:23:22 2023]  _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:23:22 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:23:22 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar  3 04:36:03 2023]       Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar  3 04:36:03 2023] task:nvidia-smi      state:D stack:    0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar  3 04:36:03 2023] Call Trace:
[Fri Mar  3 04:36:03 2023]  __schedule+0x282/0x880
[Fri Mar  3 04:36:03 2023]  ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar  3 04:36:03 2023]  schedule+0x46/0xb0
[Fri Mar  3 04:36:03 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar  3 04:36:03 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar  3 04:36:03 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:36:03 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:36:03 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
1 Like

Did you install and enable the nvidia-persistenced daemon?

I have tried to enable the persistence-mode by using nvidia-smi -pm 1 but it will appear again. I think this may have nothing to do with the persistence-mode, because before that, I have been using the 470 or earlier driver version and did not enable persistence-mode but there is no such problem.

We experience the same issues with A100s, see below. Is there a a fix or workaround published by NVIDIA?

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 ERR! On | 00000000:07:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 81920MiB | ERR! E. Process |
| | | ERR! |
±------------------------------±---------------------±---------------------+
| 1 NVIDIA A100-SXM… On | 00000000:0B:00.0 Off | 0 |
| N/A 39C P0 65W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 2 NVIDIA A100-SXM… On | 00000000:48:00.0 Off | 0 |
| N/A 34C P0 62W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 3 NVIDIA A100-SXM… On | 00000000:4C:00.0 Off | 0 |
| N/A 38C P0 67W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 4 NVIDIA A100-SXM… On | 00000000:88:00.0 Off | 0 |
| N/A 34C P0 62W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 5 NVIDIA A100-SXM… On | 00000000:8B:00.0 Off | 0 |
| N/A 38C P0 62W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 6 NVIDIA A100-SXM… On | 00000000:C8:00.0 Off | 0 |
| N/A 35C P0 62W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 7 NVIDIA A100-SXM… On | 00000000:CB:00.0 Off | 0 |
| N/A 35C P0 62W / 400W | 0MiB / 81920MiB | 0% E. Process |
| | | Disabled |
±------------------------------±---------------------±---------------------+

New driver 525.105.17 (Linux) has bug fixed by Nvidia.
Fixed an issue specific to GSP-RM that could lead to GSP RPC timeout errors (Xid
119). The issue was introduced in the first 525 driver release and was not present in
earlier drivers.

The driver 525.105.17 seems not to fix the issue for A40 GPUs. Updated two systems with 8 A40 each from the latest 515 driver to 525.105.17 and both systems show the error (Xid 119) on the first GPU, just like edw reported above. Reverting back to 515 solves the issue. Switching gsp firmware off also circumvents the problem with 525 driver.

I have been seeing this issue constantly on long-running GPU computations. I have tried everything, and certainly being up to date is not enough to circumvent this.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           On | 00000000:08:00.0 Off |                    0 |
| N/A   28C    P0               43W / 300W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           On | 00000000:48:00.0 Off |                    0 |
| N/A   27C    P0               42W / 300W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe           On | 00000000:88:00.0 Off |                    0 |
| N/A   27C    P0               42W / 300W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe           On | 00000000:C8:00.0 Off |                    0 |
| N/A   27C    P0               42W / 300W|      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

So I see ERR when I wake up and expected a successful end to my process. Not sure if this is a hardware issue or software issue, but I am willing to try anything before calling the support.

530.30.02 version doesn’t support A100, please use the 525.105.17. We already went also through this issue. I assume you are using nvidia repos, then you need to choose the correct nvidia module.