Hi! I have a really stubborn problem and I’ve been trying to fix it for weeks. Finally I have managed to narrow it down and think it is related to the number of blocks but the fight continues! External Image
In the host code there is a loop and the kernels are executed inside it.(Maybe a bad idea to begin with??) There are also some cudaMemcpy()'s inside. When NUM_BLOCK == 1 the code is working fine. When I change NUM_BLOCKS to anything higher, say 2, it stalls inside Kernel2 or directly after.
I think it has to with synhronization but I have no idea what else to do except from using cudaThreadSynchronize() after the kernels and __syncthreads() inside when ever possible. Maybe it has to do with something completely different.
Please give me your thoughts about it. Anything is appreciated!
I use a GTX 260 and Visual Studio 2008 Pro with 64-bit Vista.
//Host code
.
.
.
for(int k = 0; k < 10; k++){
cutilSafeCall( cudaMemcpy(…) );
cudaThreadSynchronize();
cutilCheckMsg(“Execution failed\n”);
Kernel1<<<NUM_BLOCKS, 32>>>(…);
cudaThreadSynchronize();
cutilCheckMsg(“Execution failed\n”);
cutilSafeCall( cudaMemcpy(…) );
cudaThreadSynchronize();
cutilCheckMsg(“Execution failed\n”);
Kernel2<<<NUM_BLOCKS, 32>>>(…);
cudaThreadSynchronize();
cutilCheckMsg(“Execution failed\n”);
cutilSafeCall( cudaMemcpy(…) );
cudaThreadSynchronize();
cutilCheckMsg(“Execution failed\n”);
}
.
.
.