calling cuda kernels in a loop

Hi All,

Is there any reported bugs regarding calling kernels in a loop in the host code?

I need to call kernels with different grid sizes in each iteration. grid size starts from 1 to 128.

inside the loop, when grid size increases to more than 16, I can’t see more than 16 blocks running.

when I call a kernel without a loop with grid size 128, every thing is fine and I can see 128 blocks running.

I used cudaThreadSynchronize() after each kernel call.

Does anyone have any clue what is wrong?




for (int b = 1; b <= n_block; b++)


            dim3 dimBlock(1, TB_SIZE);

	dim3 dimGrid(b, 1);

compute_block <<< dimGrid, dimBlock >>> (d_A, d_B, d_C);



If you are using profiler to find number of blocks – NOTE that profielr output is only for 1 MP. Different MPs can possibly run diff number of blocsk. So, there is no point in multiplying this number by number of MPs on your hardware.

dim3 dimGrid(b,1) – Does this happen for every iteration of the FORloop correctly? Interesting… I would rather prefer a direct code that sets the x and y component like this: dimGrid.x =b; dimGrid.y = 1;