I encountered this problem with the official driver 525.85.12. Many people also encountered it in the github issue(https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446), even in the latest version. Please NVIDIA colleagues pay more attention to it. Is there a solution? Thanks!
Fri Mar 3 14:41:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:01:00.0 Off | 0 |
| N/A 26C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 Off | 00000000:22:00.0 Off | 0 |
| N/A 40C P0 92W / 165W | 16096MiB / 24576MiB | 88% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 Off | 00000000:41:00.0 Off | 0 |
| N/A 42C P0 141W / 165W | 14992MiB / 24576MiB | 33% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 ERR! Off | 00000000:61:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 23025MiB / 24576MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A30 Off | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 26W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A30 Off | 00000000:A1:00.0 Off | 0 |
| N/A 26C P0 28W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A30 Off | 00000000:C1:00.0 Off | 0 |
| N/A 25C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A30 Off | 00000000:E1:00.0 Off | 0 |
| N/A 24C P0 25W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
[Fri Mar 3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar 3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar 3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar 3 04:23:22 2023] Call Trace:
[Fri Mar 3 04:23:22 2023] dump_stack+0x6b/0x83
[Fri Mar 3 04:23:22 2023] _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar 3 04:23:22 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:23:22 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:23:22 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:23:22 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar 3 04:36:03 2023] Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 3 04:36:03 2023] task:nvidia-smi state:D stack: 0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar 3 04:36:03 2023] Call Trace:
[Fri Mar 3 04:36:03 2023] __schedule+0x282/0x880
[Fri Mar 3 04:36:03 2023] ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar 3 04:36:03 2023] schedule+0x46/0xb0
[Fri Mar 3 04:36:03 2023] rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar 3 04:36:03 2023] os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar 3 04:36:03 2023] _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar 3 04:36:03 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:36:03 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:36:03 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:36:03 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).