Hardware or software problem?

I am having issues with SDK samples failing, simple kernels not returning correct results, and formerly working code no longer working. Some code, however, still works as intended. I am not noticing any other GPU related problems except when calling cuda code. I have previously posted about this issue here.

Specs are [font=“Verdana”]Palit GeForce GTX 560 Ti 2GB, 24 gigs of RAM, and Xeon quad E5504 @ 2GHz. [/font]I have since done a fresh install of Ubuntu 10.10 and Cuda 4.0.[font=“Verdana”] This solved the problem temporarily, but it has returned.[/font]

[font=“Verdana”] [/font]

[font=“Verdana”]I have produced a simple example replicating the results (duplicate.cu). Compile with nvcc -o duplicate duplicate.cu. This example allocates memory, runs a kernel that sets all the values in memory to a constant, copies the result back, and verifies. For arrays of 2562562 floats or fewer, there are no errors, but for larger arrays approximately 0.01% of the values or incorrect. The values that are incorrect and the number of incorrect values are not repeatable. Standard copying to the same memory works fine.[/font]

[font=“Verdana”] [/font]

[font=“Verdana”]This card used to work when I first installed it in March, and these problems cropped up when I upgraded to 11.04 and Cuda 4 from 10.10 and Cuda 4RC1. Does anyone have any ideas? How can I tell if this is a hardware or software problem? I don’t have another machine that I can put the card in, but I have ordered another card. Does anyone know when the next driver release will be?[/font]

[font=“Verdana”] [/font]

[font=“Verdana”]Thanks.[/font]
duplicate.cu (1.56 KB)

Does this happen immediately after a complete power cycle (not just a soft reset) or does it take a while? I’ve noticed that a complete power cycle will fix the problem for me for a while.

Not really. It does seem to start out slower (fewer incorrect values), but it reaches that 0.01% threshold after the second or third run.

Old board has been replaced with an EVGA 2GB 560ti and works properly under stress. This is seemingly a hardware issue. Strange that it only manifests itself with cuda. Now I am trying to get the original board replaced under warranty.