Howdy,
I have a code similar to that outlined below. When I specify device 0 (via cudaSetDevice()), everything works fine. When I specify any other device, the code “runs to completion” but gives random incorrect answers, Other codes I have written continue to run just fine on the other devices.
Pseudocode:
[codebox]
for(i=0;i<numunknowns;++i)
{
zerovec<<<(length/THREAD_CNT)+1,THREAD_CNT>>>(Atmp,length); //a kernel that just zeros out the array Atmp
cudaThreadSynchronize();
fillAtmp<<<dimGrid,THREAD_CNT>>>(Atmp, extra data); //a busy function that fills Atmp with some data, nothing too weird going on here though, no use of shared mem, no divergent branches (according to cudaprof)
cudaThreadSynchronize();
fillA<<dimGrid2,THREAD_CNT>>>(A, Atmp, extra data); //a simple kernel that sums up columns of Atmp and inserts those results along the columns of matrix A
cudaThreadSynchronize();
}
[/codebox]
To me, the loop (and the kernels) seem very straightforward. I don’t understand what I’m doing that would cause some dependence on device number.
Things I have tried (not a complete list):
- I have verified this behavior on two different machines with different cards/hardware. Here are the specs:
Machine 1:
Linux x64, RHEL5, CUDA 2.3b
Device 0 & 1: GTX 295
Device 2: Tesla C1060
Device 3: Tesla C1060
Machine 2:
Linux x64, RHEL5, CUDA 2.3b
Device 0: GTX 285
-
I have experimented with changing compute mode via nvidia-smi since we leave our cards in Exclusive mode usually.
-
I have also tried CUDA 2.2 and 2.3b.
-
Checked temperature of all the cards. Seem to hang around 70-75C.
-
I know there were a number of other things but now I’m drawing a blank.