I am having system hangs while executing some image analysis code on nVidia GPUs. The most recent driver to not cause this problem was 295.20, although I have not tried every single driver between that and the current ones.
As a minimal test case, I wrote a CUDA program which selects a GPU according to a command line option, allocates a 4096x2048 complex array on the device, plans a 2D FFT using cufft, then repeats a loop many times, setting the array to zero then executing the planned FFT. Every 1000 iterations, the current iteration number is printed to stdout.
I ran this test simultaneously on 4 GTX 770s. After 40 million iterations, the process running on one of the GPUs stopped updating. That GPU continued doing something, because the fan remained at about 50% and the card temperature was elevated over idle by ~25C. An error showed up in /var/log/messages soon after:
NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
repeated every 2 seconds for 30 seconds, then:
NVRM: GPU at 0000:08:00: GPU-a720f2ac-e433-b7d6-7215-ccf7d97e80b2
NVRM: Xid (0000:08:00): 62, ad75(1e88) 00000000 00000000
The os_schedule message repeated 15 times in 9 seconds 3 hours later, then again 7 minutes later.
At this point, I was able to ssh to the machine, but the terminal was hung as soon as I got a prompt. I have attached the nvidia bug report which I generated as root after a power cycle. I do not run X on this machine, so I am not sure what nvidia-settings says - I can check, if this will help. I have also attached the test case I was using.
The OS is openSUSE 11.2, kernel 2.6.31. The mobo is an Asus P9X79-E WS, and all 4 cards are EVGA GTX 770s. I set all PCI lanes to Gen2 in BIOS, according to recommendations in this forum. The driver I was using for this test was 325.08, but I suspect the results would be identical with recent drivers. I can test this if it will help to find a solution. The system also hangs when I use only one GPU at a time, and it occurs with other mobos, kernels, etc. I can generate all possible combinations of nvidia-bug-reports if it will help.
The other 3 processes continued updating until I powered down the machine.