Is there any reported bugs regarding calling kernels in a loop in the host code?
I need to call kernels with different grid sizes in each iteration. grid size starts from 1 to 128.
inside the loop, when grid size increases to more than 16, I can’t see more than 16 blocks running.
when I call a kernel without a loop with grid size 128, every thing is fine and I can see 128 blocks running.
I used cudaThreadSynchronize() after each kernel call.
Does anyone have any clue what is wrong?
for (int b = 1; b <= n_block; b++)
dim3 dimBlock(1, TB_SIZE); dim3 dimGrid(b, 1);
compute_block <<< dimGrid, dimBlock >>> (d_A, d_B, d_C);