I am having system hangs while executing some image analysis code on nVidia GPUs. The most recent driver to not cause this problem was 295.20, although I have not tried every single driver between that and the current ones.
As a minimal test case, I wrote a CUDA program which selects a GPU according to a command line option, allocates a 4096x2048 complex array on the device, plans a 2D FFT using cufft, then repeats a loop many times, setting the array to zero then executing the planned FFT. Every 1000 iterations, the current iteration number is printed to stdout. The actual analysis code is more complicated but should not be doing anything untoward. Besides performing lots of 2D FFTs, it also calls __syncthreads() from a reduce kernel and has some atomicAdd() calls from a different kernel, and does lots of memcpy() from host to device and from device to host. Regardless, the FFT only test case managed to cause some hangs on a similar time scale as the analysis code I would like to run (24-48 hours).
I ran this test simultaneously on 4 GTX 770s. After 40 million iterations, the process running on one of the GPUs stopped updating. That GPU continued doing something, because the fan remained at about 50% and the card temperature was elevated over idle by ~25C. An error showed up in /var/log/messages soon after:
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
repeated every 2 seconds for 30 seconds, then:
NVRM: GPU at 0000:08:00: GPU-a720f2ac-e433-b7d6-7215-ccf7d97e80b2
NVRM: Xid (0000:08:00): 62, ad75(1e88) 00000000 00000000
The os_schedule message repeated 15 times in 9 seconds 3 hours later, then again 7 minutes later.
At this point, I was able to ssh to the machine, but the terminal was hung as soon as I got a prompt. I have attached the nvidia bug report which I generated as root after a power cycle. I do not run X on this machine, so I am not sure what nvidia-settings says - I can check, if this will help. I have also attached the test case I was using.
The OS is openSUSE 11.2, kernel 2.6.31. The mobo is an Asus P9X79-E WS, and all 4 cards are EVGA GTX 770s. I set all PCI lanes to Gen2 in BIOS, according to recommendations in this forum. The driver I was using for this test was 325.08, but I suspect the results would be identical with recent drivers. I can test this if it will help to find a solution. The system also hangs when I use only one GPU at a time, and it occurs with other mobos, kernels, etc. I can generate all possible combinations of nvidia-bug-reports if it will help.
The other 3 processes continued updating until I powered down the machine. I will try repeating the test after removing the offending card.