A100 rwlock err is sw err?

sduo dmesg | grep -i nvidia

[5868833.531848] os_dump_stack+0xe/0x10 [nvidia]
[5868833.532027] _nv011486rm+0x3ff/0x480 [nvidia]

[5868922.248466] os_acquire_rwlock_write+0x35/0x40 [nvidia]

sudo tail -f /var/log/syslog

gather_node_measure(): LibraryFunctionError(‘NVML’, ‘nvmlDeviceGetMigMode’, 999)

During GPU training, an error occurred on GPU 0 among the four A100 GPUs, as observed in the nvidia-smi output. I extracted the corresponding logs, and while attempting to terminate all running programs on GPU 0, the “kill” command was ineffective. Therefore, I used the “fuser -k” command to forcibly terminate the processes. After approximately three hours, the GPU returned to normal.

Is this issue related to the software managing GPU virtualization?