Hi,
I have a ‘for’ loop in which I launch a kernel in each iteration. The number of threads I need in each iteration is the iteration number itself.
for (i=1; i<=sizeA; i++)
{
numThreadsPerBlock = 16;
int numBlocks = i / numThreadsPerBlock + (i % numThreadsPerBlock==0 ? 0:1);
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
Kernel<<<dimGrid,dimBlock>>>(d_A, sizeA, d_B, sizeB, d_Matrix,d_Sub,i,1);
cudaThreadSynchronize();
}
If I do it this way, say in iteration 20: I need only 20 threads but the kernel is launched with 2 blocks,i.e 32 threads. The rest of the 12 threads mess up my code because my code is written to run exactly for the number of threads I need ( 20 in this case). So the extra 12 threads cause my kernel to crash.
So, to fix this in my kerel code, I allow only those threads with index < 20 , thus stopping the other threads from execution. But somehow I’m still crashing my kernel.
Can anyone throw some light on the situation ?? External Image
Thanks a lot …
Edit : Is there a way so that I can launch exactly the number of threads as the iteration number ??