sduo dmesg | grep -i nvidia
…
[5868833.531848] os_dump_stack+0xe/0x10 [nvidia]
[5868833.532027] _nv011486rm+0x3ff/0x480 [nvidia]
…
[5868922.248466] os_acquire_rwlock_write+0x35/0x40 [nvidia]
sudo tail -f /var/log/syslog
…
gather_node_measure(): LibraryFunctionError(‘NVML’, ‘nvmlDeviceGetMigMode’, 999)
…
During GPU training, an error occurred on GPU 0 among the four A100 GPUs, as observed in the nvidia-smi
output. I extracted the corresponding logs, and while attempting to terminate all running programs on GPU 0, the “kill” command was ineffective. Therefore, I used the “fuser -k” command to forcibly terminate the processes. After approximately three hours, the GPU returned to normal.
Is this issue related to the software managing GPU virtualization?