V100 GPUs hang randomly

Hi all,

We have three servers with 4 V100s per server on our HPC cluster. They’ve been running more or less without any problems for the last 6 months. A month ago GPU jobs started hanging, usually on one GPU in a node and eventually all GPUs on a node become unresponsive. By now, the issue has affected all three of our nodes with V100s.

We mainly run Gromacs, Amber and TensorFlow jobs.
Gromacs output shows:

WARNING: An error occurred while sanity checking device #0; cudaErrorDevicesUnavailable: all CUDA-capable devices are busy or unavailable

, even though there are available GPUs with no processes running on them.
TensorFlow gives cudaErrorUnknown, Amber just hangs.

nvidia-smi output seems normal. System logs show different errors, sometimes there are none.

kernel: NVRM: Xid (PCI:0000:1a:00): 31, Ch 0000000b, engmask 00080100, intr 00000000
kernel: INFO: task pmemd.cuda:426317 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: pmemd.cuda      D ffff939d380c2080     0 426317 426315 0x00000080
kernel: Call Trace:
kernel: [<ffffffff97969d39>] schedule_preempt_disabled+0x29/0x70
kernel: [<ffffffff97967cb7>] __mutex_lock_slowpath+0xc7/0x1d0
kernel: [<ffffffff9796709f>] mutex_lock+0x1f/0x2f
kernel: [<ffffffffc542ded9>] uvm_gpu_release+0x19/0x30 [nvidia_uvm]
kernel: [<ffffffffc546a8eb>] uvm_ext_gpu_map_free+0x1b/0x20 [nvidia_uvm]
kernel: [<ffffffffc5432951>] uvm_deferred_free_object_list+0x61/0x110 [nvidia_uvm]
kernel: [<ffffffffc546b231>] uvm_api_unmap_external_allocation+0x141/0x160 [nvidia_uvm]
kernel: [<ffffffffc5425e9a>] uvm_unlocked_ioctl+0xdfa/0x11b0 [nvidia_uvm]
kernel: [<ffffffff973e933e>] ? do_numa_page+0x1be/0x250
kernel: [<ffffffff973e96e6>] ? handle_pte_fault+0x316/0xd10
kernel: [<ffffffff9743e7aa>] ? __check_object_size+0x1ca/0x250
kernel: [<ffffffff973ec1fd>] ? handle_mm_fault+0x39d/0x9b0
kernel: [<ffffffff97456950>] do_vfs_ioctl+0x3a0/0x5a0
kernel: [<ffffffff97970628>] ? __do_page_fault+0x228/0x4f0
kernel: [<ffffffff97456bf1>] SyS_ioctl+0xa1/0xc0
kernel: [<ffffffff97975ddb>] system_call_fastpath+0x22/0x27

Reboot solves the problem only for a few days. We updated Nvidia drivers and kernel, but the problem continues.
System:
CentOS 7.6
kernel 3.10.0-957.12.2.el7.x86_64
Driver Version: 418.67

I’m attaching nvidia-bug-report to this post.
nvidia-bug-report.log.gz (2.37 MB)

The 418.67 driver hasn’t been available for 6 months, so somewhere along the way you must have updated drivers. I don’t have any reason to say that there is a problem with 418.67, but you may wish to try different (older, or newer - there isn’t a newer Tesla driver ATM) drivers to see if the issue is affected.

Xid 31 is often generated as a result of an application fault, such as illegal memory access. It’s not guaranteed that that is the issue, however:

https://docs.nvidia.com/deploy/xid-errors/index.html#topic_5_2

Anyway, if a GPU gets a Xid 31 error, and the host process owning that GPU does not terminate, the GPU may become unresponsive for further activity. Rebooting usually does solve this particular effect.

If the GPUs were purchased as part of a properly configured OEM system, you may wish to contact the OEM and request support.