GTX 460 stability issues?

dankarner · September 7, 2010, 1:21pm

Is anyone else having issues with stability using GTX 460 for CUDA programming? We have two GTX 460’s in house, one from ASUS and one from Palit and neither are stable. The biggest problem we are having is the kernel (running RHEL 5.4) keeps disabling the IRQ the GPU is on. We’ve tried all the combination’s of settings in the driver readme with no luck. An interesting data point is that the behavior seems to change from boot to boot. Sometimes the IRQ gets disabled a few seconds after we start running our software, then on other boot-ups, there doesn’t seem to be a problem and we can run for hours. We have tried the following drivers: 256.40, 256.44, and the latest 256.53. We have a PNY GTX 480 that we have been running with and are having no problems with at all. Same hardware, same OS and drivers, no issues.
We have ordered a PNY GTX 460 and a PNY GTX 470 in order to perform some more testing to get additional data points and will update this thread once that information is available, but just wanted to ask the community if they have noticed similar IRQ related issues with the 460’s.
We are pushing a lot of data into the box at a pretty high rate - 500MB/sec into a 10GigE Myricom card (20,000 packets per second) and then pushing that data to the GPU at a 2000Hz rate. The GPU load is around 80% on the GTX 460, and 50% on the GTX 480.
I suspect it’s a driver issue specific to the 460 since we can just swap out the 460 with the 480 and everything works. Since we have 2 460’s behaving similarly I can’t see this being an issue with the cards, but we have the PNY 460 coming just to be sure.
This is for a relatively low power application, so utilizing 470’s or 480’s is not the preferred solution, especially since we have to run them at an ambient temperature of 40 degrees C.

Any feedback or recommendations would be much appreciated.

SPWorley · September 7, 2010, 7:46pm

I can’t answer your question, but I do have some followups which might help understand the problem better.

What’s the symptom of a kernel “disabling an IRQ”? How do you test this (looking at kernel logs?) And when it gets disabled, what happens when you call CUDA apps, what does nvidia-smi report, what happens to your display?

SPWorley · September 7, 2010, 7:46pm

I can’t answer your question, but I do have some followups which might help understand the problem better.

What’s the symptom of a kernel “disabling an IRQ”? How do you test this (looking at kernel logs?) And when it gets disabled, what happens when you call CUDA apps, what does nvidia-smi report, what happens to your display?

dankarner · September 7, 2010, 8:48pm

dmesg reports the failure (see below)

We made sure each device has it’s own IRQ. The symptom is my application stops processing data on the GPU. It’s a multi-threaded application, so the rest of it continues to run (input data buffering) but the thread processing the data making the CUDA calls is no longer processing data. It is unclear at the moment if it is simply stuck in a GPU call, or if the stream is never reporting that it has completed.

Tried irqpoll as the error indicated, but that didn’t help.

We did have RmInit failures as others have posted, but we resolved those by tweaking kernel parameters.

We are not running X, and there is no display connected to the box. (using ssh to access the hardware.)

Interestingly enough the IRQ assigned to IRQ almost always gets disabled at the same time as the IRQ assigned to the GPU. Not sure why that is since we don’t really have any USB devices connected. (Technically IPMI uses USB under the hood, but we don’t physically have anything connected to any of the USB ports.)

After the failure, restarting the application allows it to work as normal, until the IRQ is disabled again that is.

See below for: dmesg info, nvidia-smi info, and gdb call trace.

irq 106: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] do_IRQ+0xe7/0xf5

[] ret_from_intr+0x0/0xa

[] ip_rcv+0x0/0x57c

[] netif_receive_skb+0x3c9/0x3f5

[] :myri10ge:myri10ge_poll+0x788/0xbba

[] do_IRQ+0xec/0xf5

[] net_rx_action+0xac/0x1e0

[] __do_softirq+0x89/0x133

[] call_softirq+0x1c/0x28

[] do_softirq+0x2c/0x85

[] do_IRQ+0xec/0xf5

[] ret_from_intr+0x0/0xa

handlers:

[] (nv_kern_isr+0x0/0x54 [nvidia])

Disabling IRQ #106

irq 50: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] __do_softirq+0x89/0x133

[] do_IRQ+0xe7/0xf5

[] mwait_idle+0x0/0x4a

[] ret_from_intr+0x0/0xa

[] mwait_idle+0x36/0x4a

[] cpu_idle+0x95/0xb8

[] start_kernel+0x220/0x225

[] _sinittext+0x22f/0x236

handlers:

[] (usb_hcd_irq+0x0/0x55)

Disabling IRQ #50

nvidia-smi output

[drk@HPP2 ~]$ nvidia-smi -q --gpu=0

GPU 0:

    Product Name            : GeForce GTX 460

    PCI ID                  : e2210de

    Temperature             : 46 C

Additional GDB info on the thread. The threads is pegged at 100% load when the IRQ becomes disabled, below is the stack trace:

(gdb) thread 3

[Switching to thread 3 (Thread 0x420e5940 (LWP 4298))]#0 0x0000003a86cba937 in sched_yield () from /lib64/libc.so.6

(gdb) up

#1 0x00002aaaaab8baf8 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#2 0x00002aaaaab8bb6d in ?? () from /usr/lib64/libcuda.so

(gdb) up

#3 0x00002aaaaab8a757 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#4 0x00002aaaaab74fd5 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#5 0x00002aaaaab8328a in ?? () from /usr/lib64/libcuda.so

(gdb) up

#6 0x00002aaaaab6088e in ?? () from /usr/lib64/libcuda.so

(gdb) up

#7 0x00002aaaaac00147 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#8 0x00002b7b622cac8d in ?? () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#9 0x00002b7b622cc3b3 in ?? () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#10 0x00002b7b622bdba8 in cudaLaunch () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#11 0x00002b7b60420a76 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#12 0x00002b7b6042191a in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#13 0x00002b7b603d5a39 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#14 0x00002b7b603d444a in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#15 0x00002b7b603c30d7 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#16 0x000000000040c819 in CPulseCompression::processPulse (this=0x1896fa30, inputPulse=, outputPulse=0x420e50d0,

processPulse=<value optimized out>) at pulseCompression.cpp:655

655 (cufftComplex*)(currentBuf->d_complex1), CUFFT_INVERSE);

dankarner · September 7, 2010, 8:48pm

dmesg reports the failure (see below)

We made sure each device has it’s own IRQ. The symptom is my application stops processing data on the GPU. It’s a multi-threaded application, so the rest of it continues to run (input data buffering) but the thread processing the data making the CUDA calls is no longer processing data. It is unclear at the moment if it is simply stuck in a GPU call, or if the stream is never reporting that it has completed.

Tried irqpoll as the error indicated, but that didn’t help.

We did have RmInit failures as others have posted, but we resolved those by tweaking kernel parameters.

We are not running X, and there is no display connected to the box. (using ssh to access the hardware.)

Interestingly enough the IRQ assigned to IRQ almost always gets disabled at the same time as the IRQ assigned to the GPU. Not sure why that is since we don’t really have any USB devices connected. (Technically IPMI uses USB under the hood, but we don’t physically have anything connected to any of the USB ports.)

After the failure, restarting the application allows it to work as normal, until the IRQ is disabled again that is.

See below for: dmesg info, nvidia-smi info, and gdb call trace.

irq 106: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] do_IRQ+0xe7/0xf5

[] ret_from_intr+0x0/0xa

[] ip_rcv+0x0/0x57c

[] netif_receive_skb+0x3c9/0x3f5

[] :myri10ge:myri10ge_poll+0x788/0xbba

[] do_IRQ+0xec/0xf5

[] net_rx_action+0xac/0x1e0

[] __do_softirq+0x89/0x133

[] call_softirq+0x1c/0x28

[] do_softirq+0x2c/0x85

[] do_IRQ+0xec/0xf5

[] ret_from_intr+0x0/0xa

handlers:

[] (nv_kern_isr+0x0/0x54 [nvidia])

Disabling IRQ #106

irq 50: nobody cared (try booting with the “irqpoll” option)

Call Trace:

[] __report_bad_irq+0x30/0x7d

[] note_interrupt+0x1e6/0x227

[] __do_IRQ+0xbd/0x103

[] __do_softirq+0x89/0x133

[] do_IRQ+0xe7/0xf5

[] mwait_idle+0x0/0x4a

[] ret_from_intr+0x0/0xa

[] mwait_idle+0x36/0x4a

[] cpu_idle+0x95/0xb8

[] start_kernel+0x220/0x225

[] _sinittext+0x22f/0x236

handlers:

[] (usb_hcd_irq+0x0/0x55)

Disabling IRQ #50

nvidia-smi output

[drk@HPP2 ~]$ nvidia-smi -q --gpu=0

GPU 0:

    Product Name            : GeForce GTX 460

    PCI ID                  : e2210de

    Temperature             : 46 C

Additional GDB info on the thread. The threads is pegged at 100% load when the IRQ becomes disabled, below is the stack trace:

(gdb) thread 3

[Switching to thread 3 (Thread 0x420e5940 (LWP 4298))]#0 0x0000003a86cba937 in sched_yield () from /lib64/libc.so.6

(gdb) up

#1 0x00002aaaaab8baf8 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#2 0x00002aaaaab8bb6d in ?? () from /usr/lib64/libcuda.so

(gdb) up

#3 0x00002aaaaab8a757 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#4 0x00002aaaaab74fd5 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#5 0x00002aaaaab8328a in ?? () from /usr/lib64/libcuda.so

(gdb) up

#6 0x00002aaaaab6088e in ?? () from /usr/lib64/libcuda.so

(gdb) up

#7 0x00002aaaaac00147 in ?? () from /usr/lib64/libcuda.so

(gdb) up

#8 0x00002b7b622cac8d in ?? () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#9 0x00002b7b622cc3b3 in ?? () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#10 0x00002b7b622bdba8 in cudaLaunch () from /usr/local/cuda/lib64/libcudart.so.3

(gdb) up

#11 0x00002b7b60420a76 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#12 0x00002b7b6042191a in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#13 0x00002b7b603d5a39 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#14 0x00002b7b603d444a in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#15 0x00002b7b603c30d7 in ?? () from /usr/local/cuda/lib64/libcufft.so.3

(gdb) up

#16 0x000000000040c819 in CPulseCompression::processPulse (this=0x1896fa30, inputPulse=, outputPulse=0x420e50d0,

processPulse=<value optimized out>) at pulseCompression.cpp:655

655 (cufftComplex*)(currentBuf->d_complex1), CUFFT_INVERSE);

dankarner · September 10, 2010, 3:51pm

So we got the PNY GTX 460 in (VCGGTX4601XPB-OC) and simply replaced the Palit card with this one and all of our issues have been resolved. The IRQ problems have gone away. In addition a performance problem we were seeing with both the ASUS and the Palit has gone away. I have another post discussing how the performance of the cards would increase if we connected a monitor, or even just a DVI-VGA dongle to the DVI ports on the cards. The performance of this card is inline with expectations based on it’s overclock, and is constant.
So at this point it seems I can’t recommend using either the ASUS nor the Palit card with current drivers for any CUDA work.

dankarner · September 10, 2010, 3:51pm

So we got the PNY GTX 460 in (VCGGTX4601XPB-OC) and simply replaced the Palit card with this one and all of our issues have been resolved. The IRQ problems have gone away. In addition a performance problem we were seeing with both the ASUS and the Palit has gone away. I have another post discussing how the performance of the cards would increase if we connected a monitor, or even just a DVI-VGA dongle to the DVI ports on the cards. The performance of this card is inline with expectations based on it’s overclock, and is constant.
So at this point it seems I can’t recommend using either the ASUS nor the Palit card with current drivers for any CUDA work.

dfranusic · August 6, 2011, 10:36am

We are having similar issue. IRQ gets dropped at random times after boot with the following message making all cuda related applications hang and crash.
The card we are using is GeForce GTX 470

[ 9328.271495] NVRM: GPU at 0000:01:00.0 has fallen off the bus.
[ 9329.445297] irq 16: nobody cared (try booting with the “irqpoll” option)
[ 9329.445303] Pid: 0, comm: kworker/0:1 Tainted: P 2.6.39-gentoo-r3 #8
[ 9329.445310] Call Trace:
[ 9329.445311] [] __report_bad_irq+0x40/0xa9
[ 9329.445318] [] note_interrupt+0x14b/0x1b4
[ 9329.445321] [] handle_irq_event_percpu+0x178/0x190
[ 9329.445324] [] handle_irq_event+0x2c/0x48
[ 9329.445326] [] handle_fasteoi_irq+0x78/0x98
[ 9329.445329] [] handle_irq+0x83/0x8c
[ 9329.445331] [] do_IRQ+0x48/0xaf
[ 9329.445335] [] common_interrupt+0x13/0x13
[ 9329.445336] [] ? mwait_idle+0x9f/0xc6
[ 9329.445341] [] ? mwait_idle+0x4c/0xc6
[ 9329.445343] [] cpu_idle+0x5a/0x91
[ 9329.445345] [] start_secondary+0x180/0x184
[ 9329.445347] handlers:
[ 9329.445348] [] (usb_hcd_irq+0x0/0x5b)
[ 9329.445351] [] (nv_kern_isr+0x0/0x58 [nvidia])
[ 9329.445485] Disabling IRQ #16