failing under heavy double precision load

I wish to make sure Nvidia is aware of this problem in the hope that it will be fixed at Kepler GTX680.

The problem is being discussed on the TeraChem forum, but the problem is not limited to TeraChem. It always happens at GF110 and there are two reports at GF100. There would seem to be grounds for concern that it will persist into Kepler.

Nvidia would be able to reproduce this problem using TeraChem 1.45 . I believe they have a copy of TeraChem for testing purposes.

http://petachem.com/forum/index.php?topic=95.msg205#msg205

http://petachem.com/forum/index.php?topic=95.msg211#msg211

What isolates this as being a hardware problem?

Applying Occam’s razor, which is more likely: a 200,000 line CUDA program has a thread race bug, or all GeForce DP hardware is broken?

  1. It works fine on Tesla.
  2. Its not the only program that strikes this fault.
  3. Only programs that subject gaming cards to heavy double precision load will encounter this fault. There are not many of these.

Proving this fault will require interaction between the developers and Nvidia.

I suspect “the card fails” means something more dramatic than the symptoms of a program error. More info as it becomes available.

The cards hang - cease to operate. This effect is robustly reproducible. Presumably a reboot clears it.

I guess it’s of NVIDIA’s interest to just leave that problem as it is. :(

I fear they believe it is.

Tesla cards are just too expensive for a PhD student’s research account. Halve the price and they would see a rewarding market response.

Surely the developers of these codes are registered developers and/or have contacts at NVIDIA. Bugs reported via the registered developer login bug submission tool are taken very seriously by NVIDIA. As are ones reported on the forums, but without a minimal repro case or a specific set of instructions (i.e, execute these shell commands), no one can confirm nor deny that a problem exists.

If the problem is truly an overtaxing of the double precision units as has been hypothesized, then shouldn’t a kernel that does nothing but double precision a*b+c a million times per thread trigger it?

The developer’s position is that the chips are already out there and so there is nothing that can be done. My interest is in preventing this fault from being perpetuated into the Kepler GTX680, which puts me in an awkward position. Would Nvidia listen if someone who doesn’t have access to the source code were to report first-hand experience of the fault through this forum? I am told the fault is robustly reproducible. NVidia have the program and could test it themselves.

Sounds like a reasonable first test. If I had a GTX580 I would try this.

The developer has sent Nvidia a small binary that robustly hangs 5xx GeForce cards.

This thread appears to be related:

I am not sure if they care too much for the bugs. I had reported a couple of bugs… They just kept saying next release will fix etc etc… Nothing happened… And after sometime, I could not even find my bug reports in their portal…Sigh…