Correct on Device 0, Incorrect on others

Howdy,

I have a code similar to that outlined below. When I specify device 0 (via cudaSetDevice()), everything works fine. When I specify any other device, the code “runs to completion” but gives random incorrect answers, Other codes I have written continue to run just fine on the other devices.

Pseudocode:

[codebox]

for(i=0;i<numunknowns;++i)

{

zerovec<<<(length/THREAD_CNT)+1,THREAD_CNT>>>(Atmp,length); //a kernel that just zeros out the array Atmp

cudaThreadSynchronize();

fillAtmp<<<dimGrid,THREAD_CNT>>>(Atmp, extra data); //a busy function that fills Atmp with some data, nothing too weird going on here though, no use of shared mem, no divergent branches (according to cudaprof)

cudaThreadSynchronize();

fillA<<dimGrid2,THREAD_CNT>>>(A, Atmp, extra data); //a simple kernel that sums up columns of Atmp and inserts those results along the columns of matrix A

cudaThreadSynchronize();

}

[/codebox]

To me, the loop (and the kernels) seem very straightforward. I don’t understand what I’m doing that would cause some dependence on device number.

Things I have tried (not a complete list):

  1. I have verified this behavior on two different machines with different cards/hardware. Here are the specs:

Machine 1:

Linux x64, RHEL5, CUDA 2.3b

Device 0 & 1: GTX 295

Device 2: Tesla C1060

Device 3: Tesla C1060

Machine 2:

Linux x64, RHEL5, CUDA 2.3b

Device 0: GTX 285

  1. I have experimented with changing compute mode via nvidia-smi since we leave our cards in Exclusive mode usually.

  2. I have also tried CUDA 2.2 and 2.3b.

  3. Checked temperature of all the cards. Seem to hang around 70-75C.

  4. I know there were a number of other things but now I’m drawing a blank.

Blast! I have been struck by the curse of “figure out my problem as soon as I get done posting my question”. I rebooted my multicard machine and everything is working fine now. I would still be curious though if anyone has any thoughts on why this happened in the first place. Thanks.