Can anyone explain this ... ?

Hi,

Background: I have to lauch a kernel sizeA times and I need only i threads in the ith iteration.

In the following code, I used two approaches. In the 1st one, I launched i threads in the ith iteration (not exactly i, but the smallest number of blocks which can give i threads). In the 2nd case, I launch sizeA threads in all the iterations (again, the smallest number of blocks which can give sizeA threads).

My timing revealed that, the performance is better in the 2nd case where I launch a constant number of threads in all the iterations.

for (i=1; i<=sizeA; i++)

{

       numThreadsPerBlock = 16;

  

       // number of threads launched depending on iteration

       //int numBlocks = i/numThreadsPerBlock + (i % numThreadsPerBlock==0 ? 0:1);

  

       // same number of threads launched in all iterations

       int numBlocks = sizeA/numThreadsPerBlock + (sizeA % numThreadsPerBlock==0 ? 0:1);

  

       dim3 dimGrid(numBlocks);

       dim3 dimBlock(numThreadsPerBlock);

      

       Kernel<<<dimGrid,dimBlock>>>(some parameters);

  

       cudaThreadSynchronize();

  

}

I’m unable to understand why the performance is better in the second case. There are lot of unused threads in the 2nd case. Shouldn’t the first case where I launch only the required number of blocks be more efficient?

Please let me know if I did not express myself clearly… Any input is appreciated… Thanks

I’m guessing that when you set up the kernel call once, it’s reused across multiple invocations. The unused warps terminate quickly and don’t eat up many GPU cycles, and you avoid some additional overhead from both reusing the call and not having to set up multiple grids (no mod, no branches, etc).

Thanks…