Unstable/Unreliable GPU Device (Tesla C1060)

I have been developing on making parallel an algorithm for as many GPU devices as are availble. At the moment, we have 2 Tesla C1060 devices hooked up. The machine has been in the company for a year or two, mostly sitting there on and idle.

Just recently, I have been banging my head against the wall because I thought I was a bad programmer. I thought that maybe I was stupid for making a program that would work sometimes, but not all the time. The program creates 2 CPU threads to manage the 2 GPU devices, and the 1st device always worked fine. I began to notice that sometimes the 2nd device wouldn’t give me the correct input or any input at all. When I put a cuda error test after the kernel invocation, I got something about failure of kernel invocation.

I ran the simpleMultiGPU SDK example, which I know had been working just days prior, and it got hung up. So, something tells me that the 2nd GPU device is problematic. Then, I reboot the machine and it works! What do you think might be causing this type of malfunction? Just your garden variety memory leaks? Any speculation?

Also, almost without fail, whenever I change which data set of .txt files to pull (I bring in anywhere from < 1 MB to 400 MB worth of text files) from one to another, the first time I start the program, it invokes the kernel. Then each GPU call returns in 0.000000 seconds, and the rest of the program finishes normally on the host. The output is a bunch of 0s, meaning nothing was written except the default, non-updated values.

When I run the program a second time, it works perfectly, and it continues to work perfectly then on out (until I change the data sets again). Any ideas on this?

EDIT: I am running Windows Server 2003 SP2, Quadcore with 4GB of memory. I am using CUDA 3.0, Visual Studio 2008 with the runtime API, not the driver API.

These are my 2 problems. Thanks for any ideas.

Daniel

Power supply would be my first guess.

Thanks for the reply, tmurray.

I will have somebody check on that. Any ideas on the first time I run the program, it has a kernel launch failure, but every time after that it works fine? By the way, this is in Debug mode. I still haven’t got Release mode to work yet. I’m getting garbage back from the GPU. I checked all the input data, and it is getting it just fine.

Daniel

Any idea thoughts on what would cause the program to erroneously run the first time on a given data set, but each subsequent execution works perfectly? I’m having this problem in Debug & Release mode…

If the program is running on a data set the first time, then I kill it midway through, when I run the program once again, it still fails. It is only after I run the program the first time, it fails with the kernel launches, returns early and the program finishes–this is the scenario by which my second run of the program then works. I think this problem only started happening after implementing my program across multiple GPUs much like the SDK example simpleMultiGPU.