Question about nested for-loop, and how it works

Hi guys.

I am learning the CUDA course, and I don’t understand the nested for loop.
Suppose I have a kernel like this,

  for (int i = idx; i < N; i += stride)
  {
    a[i] *= 2;
    for (int j = 0; j < 2; j ++)
    {
      printf("H");
    }
  }

I think the outer for-loop is distributed to CUDA threads to be parallelized, but what happened to the inner for-loop and how could the system know that the inner for-loop is different?

Thank you!

Each thread will execute that entire nested loop. What you need to do in CUDA is learn to implicitly account for loop indices in the thread counters. So

for (i = threadIdx.x; i < 32000; i += blockDim.x) {
  for (j = 0; j < 2; j++) {
    printf("H");
  }
}

would be an effective way to have every thread work on a list of 32,000 tasks (print two “H” characters). Every thread will still execute the entire nested loop structure, but notice how each thread perceives the counter i to begin with a staggered start (based on the thread index) and we increment by the block dimension (assuming this is to be executed on only one block) to avoid multiple threads doing the same task many times.

Alternatively you could try using the grid system. Since you can create multiple blocks in a grid for each direction you could have the “x” blocks on the outer loops and the “y” blocks performing the inner loops. You can read about it with a quick google of CUDA grids or CUDA nested loops.