how gang and vector parallelization of a loop map to the GPU

brushman · October 14, 2013, 1:04am

Hi,

Consider a single parallel loop which contains no nested loops and where n is sufficiently big to fill the entire GPU with work

!$acc kernels do
do i = 1,n
  calculations
end do

so that the compiler output yields something like
!$acc loop gang, vector(128)

So what does this mean exactly? Will this ensure that each core on the GPU has a loop iteration to work on? Is it possible that only one multiprocessor is being used, or that each multiprocessor isn’t being entirely filled with work? Why doesn’t it tell me how many gangs are being used?

My guess is that the n iterations are divided amongst the different gangs, which each correspond to a thread block. The number of thread blocks, or gangs we get, depends on the number of multi-processors our GPU has. The vector(128) specifies that 128 threads are in each thread block.

So, upon execution, each multiprocessor executes 128 threads in parallel, where each thread corresponds to an iteration of the loop. How accurate is this?

MatColgrove · October 14, 2013, 2:25pm

So what does this mean exactly?

The best way to think of this is that the compiler is strip-mining this loop (i.e. adding an inner loop which computes 128 elements). The inner stip-mined loop then mapped to the thread block where each thread computes one iteration. The outer loop would then be mapped to the grid block with one inner loop (i.e. groups of 128).

Will this ensure that each core on the GPU has a loop iteration to work on?

Depends upon the value of N and how many cores your GPU has. While not always the case, in this example each iteration of the loop gets mapped to a thread. If the number of iterations is less then the number of the core, then you will have idle cores. If you have more iterations than cores, some gangs will need to wait for others to complete before computing.

Is it possible that only one multiprocessor is being used, or that each multiprocessor isn’t being entirely filled with work?

Sure, if N is less than the number of cores.

Why doesn’t it tell me how many gangs are being used?

It’s variable at runtime given the size of N. Set the environment variable “PGI_ACC_TIME=1” to see the profile, including the grid dimensions.

This article is a bit dated but still gives a great explanation of the CUDA threading model. Account Login | PGI

Mat

sWienke · February 25, 2014, 9:28am

Dear Mat,
one further question concerning the schedule that is taken for a given number of gangs. Lets say we have:

int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang vector
for (int i=0; i<n; i++) // do something

I assume that internally happens the code below. Given a fixed gang size, is it correct that we have some kind of “static” schedule among the gangs. What I mean by this is the distribution methodology from OpenMP’s static for schedule: Put the same size of iterations into one block and put close iterations together in one block. If so, I believe the schedule would look like the one below (with two nested loops for the gangs, and one strip-mined loop for the threads within the gang).

int n = 1024;
int blocks = 8;
int threadsperblock = 32;

#pragma acc parallel num_gangs(block) vector_length(threadsperblock)
#pragma acc loop gang
for (int i=0; i<blocks; i++) { // 0..8 - number of blocks
 for (int j=i*(n/blocks); j<(i+1)*(n/blocks); j+=32) { // 0..127, 128..255,.. - number of iterations within one block that is strip-mined by threadsperblock
#pragma acc loop vector
  for (int k=j; k<j+threadsperblock; k++) { //0..32 - threadsperblock
    // do something
}}}

Thanks, Sandra

MatColgrove · February 25, 2014, 11:39pm

Hi Sandra,

While the exact kernel may vary depending on the host code, it would be conceptionally similar to what you have. Though, the kernel would stride by the gang*vector length (256). Hence block 0 would compute elements 0-31, 256-287, 512-543, 768-799

Something like:

#pragma acc parallel num_gangs(8) vector_length(32)
{
#pragma acc loop gang vector
   for (int i=0; i < N; ++i) {
      val[i] = i;
   }
}

Would get turned into something like:

// outer gang loop
      for (int i = 0; i < 8; ++i) {
// stripmine - stride by num gangs * vector length
       for (int j = i*32; j < 1024; j+=(32*8)) {
// vector loop
        for (int k=j; k<j+32;++k) {
            val2[k] = k;
        }
   }
}

sWienke · February 26, 2014, 7:39am

Hi Mat,
Thanks for your answer. Just to make sure: The outer 32 is taken from n/threadsperblock = 1024 / 32 = 32, correct?
Sandra

MatColgrove · February 26, 2014, 11:52pm

Just to make sure: The outer 32 is taken from n/threadsperblock = 1024 / 32 = 32, correct?

No, that’s a typo. It should be the gang size, i.e. “8”.

Sorry!
Mat

I just fixed the original post…

Topic		Replies	Views
OpenACC parallel loop gang, vector Legacy PGI Compilers	4	6444	December 7, 2023
Computing multiple elements per thread in OpenACC Legacy PGI Compilers	3	2433	May 17, 2013
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7013	February 10, 2016
OpenACC: Fine tuning accelerator performance nvc, nvc++ and nvfortran	5	1229	March 18, 2021
"!$acc routine vector" leads to 32 threads per gan Legacy PGI Compilers	4	2295	June 5, 2023
Mapping between OpenACC and CUDA parallelism levels Legacy PGI Compilers	3	6544	April 16, 2015
Help understanding gang and vector specification Legacy PGI Compilers	1	2389	November 26, 2012
gang and worker Legacy PGI Compilers	3	2323	May 7, 2013
Question about compiler optimization in openmp offloading nvc, nvc++ and nvfortran	6	1139	November 29, 2021
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4029	August 20, 2012

how gang and vector parallelization of a loop map to the GPU

Related topics