issues with Titan X

Can anyone confirm that Titan X works with CUDA 7.0.28 and kernel driver 346.59 on Linux? Just got a Titan X and I think it might be defective. It shows an ERR in the fan speed column when running nvidia-smi, but then subsequent calls report normal fan speeds. When I start a CUDA program, the card hangs. I have a 980 that works perfectly in this machine with the same software, so I think my software is setup correctly.

According to the changelogs, Titan X support was just added for this kernel driver version. Any known issues that I might be hitting, or should I probably RMA the card? I already tried swapping around PCIe slots, and I’m certain my power supply is up to the task of driving the Titan (1200W).

You should try 349.16, which is the latest driver and the first that officially supports the Titan X.
https://devtalk.nvidia.com/default/topic/825597

Just tried it. The 349.16 release seems better, but does not totally fix things for me. The ERR messages on nvidia-smi seem fixed, and some quick CUDA programs even run. But, once I try really loading things up, I get illegal instruction message and crashes. Things like this show in dmesg:

NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 0, TPC 0): Out Of Range Register
[ 6790.023961] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 0, TPC 0): Physical Multiple Warp Errors
[ 6790.023965] NVRM: Xid (PCI:0000:01:00): 13, Graphics Exception: ESR 0x504648=0xd 0x504650=0x4 0x504644=0xd3eff2 0x50464c=0x7f
[ 6790.023983] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Warp Exception on (GPC 1, TPC 0): Out Of Range Register
[ 6790.023987] NVRM: Xid (PCI:0000:01:00): 13, Graphics SM Global Exception on (GPC 1, TPC 0): Physical Multiple Warp Errors

I’m only loading about 4 GB of data on the card, so it’s not actually out of memory or anything. This is CUDA code that works fine on my 980 in the same machine. (I pulled out the 980 to rule out any conflicts.)

Maybe the driver still has some problems. Follow the steps here: https://devtalk.nvidia.com/default/topic/522835/linux/if-you-have-a-problem-please-read-this-first/
to create a bugreport with all relevant information and attach it to your post so this can get fixed.

OK, I tried various CUDA programs and they worked for a while and I was unable to reproduce the error. Then I got a pretty hard lock-up, and now I can’t even get X to start- it tries to start and the KDE splash screen appears, but then many green-speckly screen corruptions and a hang. I was still able to log into the machine remotely at that point, so I did and ran the bug report script.

nvidia-bug-report.log.gz (95.4 KB)

Still waiting on a response here. Can anyone just confirm that this card works for them for CUDA processing in Linux? At this point I am fairly certain than my card is defective. Just want to rule out any known problems with the drivers before I start RMA process.

In case you don’t get a response, you might want to consider installing a 30-day trial of Windows on a external hard drive to see if the problems still occur. I know this is a hassle but might be worth it before you deal with RMA.