Hi all,
We have three servers with 4 V100s per server on our HPC cluster. They’ve been running more or less without any problems for the last 6 months. A month ago GPU jobs started hanging, usually on one GPU in a node and eventually all GPUs on a node become unresponsive. By now, the issue has affected all three of our nodes with V100s.
We mainly run Gromacs, Amber and TensorFlow jobs.
Gromacs output shows:
WARNING: An error occurred while sanity checking device #0; cudaErrorDevicesUnavailable: all CUDA-capable devices are busy or unavailable
, even though there are available GPUs with no processes running on them.
TensorFlow gives cudaErrorUnknown, Amber just hangs.
nvidia-smi output seems normal. System logs show different errors, sometimes there are none.
kernel: NVRM: Xid (PCI:0000:1a:00): 31, Ch 0000000b, engmask 00080100, intr 00000000
kernel: INFO: task pmemd.cuda:426317 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: pmemd.cuda D ffff939d380c2080 0 426317 426315 0x00000080
kernel: Call Trace:
kernel: [<ffffffff97969d39>] schedule_preempt_disabled+0x29/0x70
kernel: [<ffffffff97967cb7>] __mutex_lock_slowpath+0xc7/0x1d0
kernel: [<ffffffff9796709f>] mutex_lock+0x1f/0x2f
kernel: [<ffffffffc542ded9>] uvm_gpu_release+0x19/0x30 [nvidia_uvm]
kernel: [<ffffffffc546a8eb>] uvm_ext_gpu_map_free+0x1b/0x20 [nvidia_uvm]
kernel: [<ffffffffc5432951>] uvm_deferred_free_object_list+0x61/0x110 [nvidia_uvm]
kernel: [<ffffffffc546b231>] uvm_api_unmap_external_allocation+0x141/0x160 [nvidia_uvm]
kernel: [<ffffffffc5425e9a>] uvm_unlocked_ioctl+0xdfa/0x11b0 [nvidia_uvm]
kernel: [<ffffffff973e933e>] ? do_numa_page+0x1be/0x250
kernel: [<ffffffff973e96e6>] ? handle_pte_fault+0x316/0xd10
kernel: [<ffffffff9743e7aa>] ? __check_object_size+0x1ca/0x250
kernel: [<ffffffff973ec1fd>] ? handle_mm_fault+0x39d/0x9b0
kernel: [<ffffffff97456950>] do_vfs_ioctl+0x3a0/0x5a0
kernel: [<ffffffff97970628>] ? __do_page_fault+0x228/0x4f0
kernel: [<ffffffff97456bf1>] SyS_ioctl+0xa1/0xc0
kernel: [<ffffffff97975ddb>] system_call_fastpath+0x22/0x27
Reboot solves the problem only for a few days. We updated Nvidia drivers and kernel, but the problem continues.
System:
CentOS 7.6
kernel 3.10.0-957.12.2.el7.x86_64
Driver Version: 418.67
I’m attaching nvidia-bug-report to this post.
nvidia-bug-report.log.gz (2.37 MB)