Fatal errors with all drivers after 295.40

I am using various GPUs to run CUDA code for scientific image analysis. I recently purchased several GTX 770 cards, hoping to exploit the increased number of shaders and amount of RAM. Since I could not get the new cards to work with old drivers, I updated to 319, ran into problems, now am running 325, still having problems.

The image analysis is run on one image after another, and takes a few minutes per image. The code has been debugged and memchecked. After about two days of running, the analysis hangs (gets stuck on a data point), and eventually errors show up in messages. The most common error is “NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context.” There are also sometimes errors like “NVRM: GPU at 0000:08:00: GPU-46ee5dba-1060-3ea0-761d-afd328771b21” and “NVRM: Xid (0000:08:00): 62, ad75(1e88) 00000000 00000000”. Once the analysis hangs, I can kill the process manually, but then I am unable to do anything with the GPU until a power cycle.

I will give the specs for the latest configuration to have problems, but it seems to me that the issue is with the drivers, since the same code runs stably (for months!) on multiple hardware / software combinations. With drivers 3XX, I encounter this problem on kernels 2.6 through 3.8, gcc 4.4 to 4.7, and after trying various tricks suggested all over the internet (it seems MANY people have had MANY problems using these drivers!). Regardless, I am not running X during the analysis, and I am unable to start X after a crash, so the bug report is generated from text mode.

I would be grateful for ANY help with this issue. If the only solution is to revert to 570 cards which are compatible with the 295 drivers, then just let me know so I don’t have to take a leap of faith. If there is any way I can get better diagnostics, let me know. If anyone has had any similar problems and found any solutions or suggestions, please either reply here or PM me. Thanks!

nvidia-bug-report.log.gz (81.7 KB)