how gang and vector parallelization of a loop map to the GPU

Hi,

Consider a single parallel loop which contains no nested loops and where n is sufficiently big to fill the entire GPU with work

!$acc kernels do
do i = 1,n
  calculations
end do

so that the compiler output yields something like
!$acc loop gang, vector(128)

So what does this mean exactly? Will this ensure that each core on the GPU has a loop iteration to work on? Is it possible that only one multiprocessor is being used, or that each multiprocessor isn’t being entirely filled with work? Why doesn’t it tell me how many gangs are being used?

My guess is that the n iterations are divided amongst the different gangs, which each correspond to a thread block. The number of thread blocks, or gangs we get, depends on the number of multi-processors our GPU has. The vector(128) specifies that 128 threads are in each thread block.

So, upon execution, each multiprocessor executes 128 threads in parallel, where each thread corresponds to an iteration of the loop. How accurate is this?

So what does this mean exactly?

The best way to think of this is that the compiler is strip-mining this loop (i.e. adding an inner loop which computes 128 elements). The inner stip-mined loop then mapped to the thread block where each thread computes one iteration. The outer loop would then be mapped to the grid block with one inner loop (i.e. groups of 128).

Will this ensure that each core on the GPU has a loop iteration to work on?

Depends upon the value of N and how many cores your GPU has. While not always the case, in this example each iteration of the loop gets mapped to a thread. If the number of iterations is less then the number of the core, then you will have idle cores. If you have more iterations than cores, some gangs will need to wait for others to complete before computing.

Is it possible that only one multiprocessor is being used, or that each multiprocessor isn’t being entirely filled with work?

Sure, if N is less than the number of cores.

Why doesn’t it tell me how many gangs are being used?

It’s variable at runtime given the size of N. Set the environment variable “PGI_ACC_TIME=1” to see the profile, including the grid dimensions.

This article is a bit dated but still gives a great explanation of the CUDA threading model. Account Login | PGI

  • Mat

Dear Mat,
one further question concerning the schedule that is taken for a given number of gangs. Lets say we have:

int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang vector
for (int i=0; i<n; i++) // do something

I assume that internally happens the code below. Given a fixed gang size, is it correct that we have some kind of “static” schedule among the gangs. What I mean by this is the distribution methodology from OpenMP’s static for schedule: Put the same size of iterations into one block and put close iterations together in one block. If so, I believe the schedule would look like the one below (with two nested loops for the gangs, and one strip-mined loop for the threads within the gang).

int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang
for (int i=0; i<blocks; i++) { // 0..8 - number of blocks
 for (int j=i*(n/blocks); j<(i+1)*(n/blocks); j+=32) { // 0..127, 128..255,.. - number of iterations within one block that is strip-mined by threadsperblock
#pragma acc loop vector
  for (int k=j; k<j+threadsperblock; k++) { //0..32 - threadsperblock
    // do something
}}}

Thanks, Sandra

Hi Sandra,

While the exact kernel may vary depending on the host code, it would be conceptionally similar to what you have. Though, the kernel would stride by the gang*vector length (256). Hence block 0 would compute elements 0-31, 256-287, 512-543, 768-799

Something like:

#pragma acc parallel num_gangs(8) vector_length(32)
{
#pragma acc loop gang vector
   for (int i=0; i < N; ++i) {
      val[i] = i;
   }
}

Would get turned into something like:

// outer gang loop
      for (int i = 0; i < 8; ++i) {
// stripmine - stride by num gangs * vector length
       for (int j = i*32; j < 1024; j+=(32*8)) {
// vector loop
        for (int k=j; k<j+32;++k) {
            val2[k] = k;
        }
   }
}

Hi Mat,
Thanks for your answer. Just to make sure: The outer 32 is taken from n/threadsperblock = 1024 / 32 = 32, correct?
Sandra

Just to make sure: The outer 32 is taken from n/threadsperblock = 1024 / 32 = 32, correct?

No, that’s a typo. It should be the gang size, i.e. “8”.

Sorry!
Mat

I just fixed the original post…