System hangs with drivers 319.23, 319.32, 325.08 and others - simple test case included

I am having system hangs while executing some image analysis code on nVidia GPUs. The most recent driver to not cause this problem was 295.20, although I have not tried every single driver between that and the current ones.

As a minimal test case, I wrote a CUDA program which selects a GPU according to a command line option, allocates a 4096x2048 complex array on the device, plans a 2D FFT using cufft, then repeats a loop many times, setting the array to zero then executing the planned FFT. Every 1000 iterations, the current iteration number is printed to stdout.

I ran this test simultaneously on 4 GTX 770s. After 40 million iterations, the process running on one of the GPUs stopped updating. That GPU continued doing something, because the fan remained at about 50% and the card temperature was elevated over idle by ~25C. An error showed up in /var/log/messages soon after:

NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

repeated every 2 seconds for 30 seconds, then:

NVRM: GPU at 0000:08:00: GPU-a720f2ac-e433-b7d6-7215-ccf7d97e80b2
NVRM: Xid (0000:08:00): 62, ad75(1e88) 00000000 00000000

The os_schedule message repeated 15 times in 9 seconds 3 hours later, then again 7 minutes later.

At this point, I was able to ssh to the machine, but the terminal was hung as soon as I got a prompt. I have attached the nvidia bug report which I generated as root after a power cycle. I do not run X on this machine, so I am not sure what nvidia-settings says - I can check, if this will help. I have also attached the test case I was using.

The OS is openSUSE 11.2, kernel 2.6.31. The mobo is an Asus P9X79-E WS, and all 4 cards are EVGA GTX 770s. I set all PCI lanes to Gen2 in BIOS, according to recommendations in this forum. The driver I was using for this test was 325.08, but I suspect the results would be identical with recent drivers. I can test this if it will help to find a solution. The system also hangs when I use only one GPU at a time, and it occurs with other mobos, kernels, etc. I can generate all possible combinations of nvidia-bug-reports if it will help.

The other 3 processes continued updating until I powered down the machine.

nvidia-bug-report.log.gz (103 KB)
main.cu (1.16 KB)

I had hangs with recent drivers too. Slideshow happens sometime while playing 3D, then system completely freezes and doesn’t reply ssh nor pings. I noticed that card overclocks SM up to 1200Mhz at the moment, but I think it should’t go over 980MHz in boost mode as mentioned in specification.

I tried to force medium performance level via “PerfLevelSrc=0x2222; PowerMizerDefaultAC=0x2” to avoid this in cost of productivity which was not critical. And it helped. System didn’t freeze for few days.

For now I switched back to auto powermizer mode. GPU clocks 1050Mhz at 60% load. Is it normal? Is there any way to force boost limit at 980MHz with adaptive power saving feature enabled? I did’t try to set “coolbits” and manual overclock feature yet.

The card is Geforce GTX 670 with driver 325.08. Debian kernel 3.9.4 (running) and 3.10. First regular freeze happened with driver 319.32 on 3.9.4 kernel. I did’n play 3D for a couple of months before it, and system had stable uptime for two month.

zakaryah, can you provide the actual source code for the test case? I would like to test it here too.

I am having a similar problem, not with OpenCL/CUDA code, but with general desktop use. I have a Gigabyte GV-N760OC-2GD Rev2.0 graphics card on a Gigabyte GA-Z87X-UD3H motherboard. I had been receiving:

NVRM: Xid (0000:01:00): 59, 009e(1cf0) 00000000 00000000

and eventual graphics lockup prior to upgrading the BIOS to a new revision, which seems to have resolved the “Xid: 59” error, but now after several days of uptime, I’m getting:

Aug 9 09:49:56 electra kernel: [258243.231684] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Aug 9 09:49:56 electra kernel: [258243.231840] NVRM: GPU at 0000:01:00: GPU-cdd4285b-04bc-3196-2ed7-66f0f6ba9de8
Aug 9 09:49:56 electra kernel: [258243.231842] NVRM: Xid (0000:01:00): 62, b444(1e9c) 00000000 00000000

followed by the graphics card seizing up (though I can move the mouse, nothing else responds), but I can ssh into the machine from a laptop.

I’m running Mint 15 (which is based on Ubuntu 13.04), with a 3.8.0-27-generic kernel. Also, as time went by, I did occasionally receive “os_schedule: Attempted to yield CPU while in atomic or interrupt context” messages, accompanied by a brief stutter.

So far, I have tried the 310.44, 313.30, and 319.32 drivers; I’m currently trying the 325.15 drivers, hoping for an improvement. I am booting the machine via UEFI, so the Linux kernel is using the efifb driver.

Edit: The 325.15 drivers didn’t fix the problem. If it happens again, I’m going to put my older GTS 450 into this machine and see if that addresses the stability issue.

Edit 2: With 3 and a half days of uptime, no errors after falling back to my GTS 450. Seems like a more general driver bug with the new nVidia GPU?

Edit 3: 5 days of uptime with my GTS 450, and no NVRM errors. It’s either a defective card, or the nVidia drivers have serious issues with the new GPUs.

Edit 4: New GTX 760 received, no errors so far. Looking like a defective card.

Edit 5: New card ended up doing the same thing. Returned. Will get a different card later.

me too.

Where’s your simple testcase?

There are no sorcerers here.

Where’s your sample testcase?

I attached two files - nvidia-bug-report.log.gz and main.cu. The latter should be easy to compile and execute. I have stopped wasting my time with the 770 cards and reverted to 570 and 670s, with drivers that do not exhibit this problem (295.20 and 304.88). I have eight 770 cards gathering dust until someone makes a suggestion for fixing this problem, at which point I will set up another machine for testing.

zakaryah, What version of CUDA sdk you are using ? I am getting below error. Also downloaded cutil.h in /usr/local/cuda/include/ but compilation issue.

root@test-Precision-WorkStation-T7500:~# nvcc main.cu
main.cu:8:19: fatal error: cutil.h: No such file or directory
compilation terminated.

CUDA SDK v4.1. Looks like you need to add /usr/local/cuda/include to your include path??

zakaryah, Still issue to compile:

root@test-Precision-WorkStation-T7500:~# nvcc main.cu
/tmp/tmpxft_0000118d_00000000-13_main.o: In function main': tmpxft_0000118d_00000000-1_main.cudafe1.cpp:(.text+0x107): undefined reference to cufftPlan2d’
tmpxft_0000118d_00000000-1_main.cudafe1.cpp:(.text+0x22b): undefined reference to `cufftExecC2C’
collect2: ld returned 1 exit status

Can you share your cutil.h file ? I think Its not come with sdk

Does anyone know if the 36 hour TDR bug (Windows) is related to these problems with the 7XX series linux drivers? That bug is well documented here

https://forums.geforce.com/default/topic/549618/geforce-700-600-series/gtx-780-freezing-and-stuttering-after-about-2-days-on-/1

and was apparently solved. The symptoms seem strikingly similar to me. Several people on the Folding@home forum have complained about this problem in linux as well. If the bugs are related, are there plans to fix this in the linux driver? I am sitting on a pile of expensive 770 cards which I cannot use.

Same here. My 4 GTX 780 give wrong fan reports after exactly 36 hours, soon after I loose X and have to do a hard reboot. Have contacted Nvidia, but no real action on their side so far. Still hoping.
D.

Aaron - any comments on this? Thanks.

I have the same problem with my GTX 780.
Been having these problems ever since I bought the card.
It seems related to the 36 hours bug since my uptime is always around 1 day and 12 hours when the problems starts happening.

20:45:27 up 1 day, 12:01,  6 users,  load average: 2.51, 1.25, 1.05

Either my X freezes completely, or I get really bad stuttering.
Whenever these problems starts occuring querying GPU information using nvidia-smi freezes X for a few seconds before outputting the following:

Fri Oct 25 20:08:08 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.325.15   Driver Version: 325.15         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780     Off  | 0000:03:00.0     N/A |                  N/A |
|ERR!   61C  N/A     N/A /  N/A |      279MB /  3071MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+

Notice the ERR! in there.
This is how it usually looks:

Fri Oct 25 20:16:10 2013       
+------------------------------------------------------+                       
| NVIDIA-SMI 5.325.15   Driver Version: 325.15         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 780     Off  | 0000:03:00.0     N/A |                  N/A |
| 29%   41C  N/A     N/A /  N/A |      281MB /  3071MB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|    0            Not Supported                                               |
+-----------------------------------------------------------------------------+

If I SSH to my computer and kill X I get output in dmesg similar to the following:

[135221.903160] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[135223.903168] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[135223.903270] NVRM: GPU at 0000:03:00: GPU-f7645f27-47b0-1f86-ef7c-0bafedf12f90
[135223.903273] NVRM: Xid (0000:03:00): 62, a2a2(1e7c) 00000000 00000000
[135229.918660] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
[135231.918668] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context

My specs:

OS: Arch Linux, kernel 3.11.6
DRV: 325.15, although I have tested every single release that is compatible with my GTX 780 with the same result
GPU: ASUS GTX 780 3GB
nvidia-bug-report.log.gz (79.5 KB)

I have both a GTX 460 and an GTX 650 Ti Boost on two different machines. A few months ago I updated drivers and started getting systems freezes on both machines, especially every time the system would try to come out of standby … it would freeze and have to be cold-booted. I tried rolling back to several older versions of the drivers, but for the GTX 460, only 314.07 is stable and for the GTX 650 Ti Boost, I am now testing 314.22 (every driver after that causes the freezes). I called Nvidia tech support and they are the ones who suggested that 314.07 would be stable for the GTX 460, admitting there is an issue. I am not sure how newer drivers get the WHLQ label because a huge number of people have reported/blogged about this problem with Nvidia drivers. I am scared to use an Nvidia board for my next computer.

I have a 760 on Ubuntu 12.04 64 bit, which was doing the same thing (the os_scheduler error, X using 100% CPU). It appears to be resolved after I disabled application profiles, and setting PowerMizer to always ‘Prefer Maximum Performance’.

I have a GTX650 on debian wheezy and I ran into the same problem after extending the memory of my computer and a BIOS reset. I tried several changes in the BIOS, but without any success.

Setting the PowerMizer ‘Prefer Maximum Performance’ helps. Unfortunately this makes the fan run with more noise.
My workaround is to switch to “Prefer Maximum Performance” before running pm-sleep, this can be done by following command:

nvidia-settings --assign=“GPUPowerMizerMode=1” && sudo pm-suspend
after resume the PowerMizerMode is set again to adaptive by placing following script in the /etc/pm/sleep.d/ folder:

#!/bin/sh
case “$1” in
thaw|resume)
sudo -u nvidia-settings --assign=“GPUPowerMizerMode=2”
;;
*)
;;
esac

I hope it helps somebody!