Hi All,
Is there any reported bugs regarding calling kernels in a loop in the host code?
I need to call kernels with different grid sizes in each iteration. grid size starts from 1 to 128.
inside the loop, when grid size increases to more than 16, I can’t see more than 16 blocks running.
when I call a kernel without a loop with grid size 128, every thing is fine and I can see 128 blocks running.
I used cudaThreadSynchronize() after each kernel call.
Does anyone have any clue what is wrong?
Thanks.
=======================================================
[codebox]
for (int b = 1; b <= n_block; b++)
{
dim3 dimBlock(1, TB_SIZE);
dim3 dimGrid(b, 1);
compute_block <<< dimGrid, dimBlock >>> (d_A, d_B, d_C);
CUDA_SAFE_CALL(cudaThreadSynchronize());
}[/codebox]