 Hi,

I have a ‘for’ loop in which I launch a kernel in each iteration. The number of threads I need in each iteration is the iteration number itself.

``````for (i=1; i<=sizeA; i++)

{

int numBlocks = i / numThreadsPerBlock + (i % numThreadsPerBlock==0 ? 0:1);

dim3 dimGrid(numBlocks);

Kernel<<<dimGrid,dimBlock>>>(d_A, sizeA, d_B, sizeB, d_Matrix,d_Sub,i,1);

}
``````

If I do it this way, say in iteration 20: I need only 20 threads but the kernel is launched with 2 blocks,i.e 32 threads. The rest of the 12 threads mess up my code because my code is written to run exactly for the number of threads I need ( 20 in this case). So the extra 12 threads cause my kernel to crash.

So, to fix this in my kerel code, I allow only those threads with index < 20 , thus stopping the other threads from execution. But somehow I’m still crashing my kernel.

Can anyone throw some light on the situation ?? Thanks a lot …

Edit : Is there a way so that I can launch exactly the number of threads as the iteration number ??

I assume your thread count grows very large, not just to the example of 20 you give. And I’ll ignore your inefficient example of 16 threads per block, since this is about thread definition, not optimization.

``````int numBlocks = (i+numThreadsPerBlock-1) / numThreadsPerBlock;
``````

You can’t launch an exact number of threads you want in general, since the product of block size and threads per block is the total number of threads. Again, I’m assuming you’re really going to large thread counts, not just 20.

So you should indeed just choose your threads-per-block based on efficiency… often this will be 32 or 64. Then compute the number of blocks as above, then in your code, as you say, just branch at the beginning based on thread ID number, something like:

``````int tid=threadIdx.x+blockDim.x*blockIdx.x;