Computing multiple elements per thread in OpenACC

sWienke · February 20, 2013, 2:27pm

Hi,
assume we have the following code:

#pragma acc kernels
#pragma acc loop gang(16) vector(32)
for (int i=0; i<2048; i++) {
  // do something with array[i]
}

With PGI Compiler 12.9, this meant that we created a grid of size 16 and blocks of size 32 so that each CUDA thread would execute 4 elements.
However, with PGI Compiler 13.1 this is not possible anymore. If I denote vector and gang size, then the gang size will be ignored during execution (however, the compiler feedback will tell me that is uses 16 gangs). With 13.1, the compiler automatically executes the loop with a grid size of 64 (and vector size 32).
Is this a bug or intended? If the latter, why?
Kind regards, Sandra

MatColgrove · February 20, 2013, 11:34pm

Hi Sandra,

No, this doesn’t look correct. I’ve opened up a problem report (TPR#19149) and sent it to our engineers for further investigation.

Thanks!
Mat

sWienke · February 21, 2013, 9:49am

Thanks.
Just one addition: If I use a gang schedule for an outer loop, the vector schedule for the inner one of a loop nest and specify both sizes, then the specified size of the gang loop will also be ignored:

#pragma acc parallel vector_length(64) num_gangs(128)
#pragma acc loop gang
        for( int j = 0; j < n; j++)
        {
#pragma acc loop vector
            for( int i = 0; i < m; i++ ) {..}
        }

The output of ACC_NOTIFY shows that block=64, but grid=8190 (which is n in my case).
Sandra

tull · May 17, 2013, 11:48pm

Sandra,

TPR 19149 has been fixed in the current 13.5 release.

dave

Topic		Replies	Views
how gang and vector parallelization of a loop map to the GPU Legacy PGI Compilers	5	8018	February 26, 2014
How to Change Loop Scheduling Legacy PGI Compilers	1	2906	January 19, 2011
OpenACC parallel loop gang, vector Legacy PGI Compilers	4	6444	December 7, 2023
gang and worker Legacy PGI Compilers	3	2323	May 7, 2013
Mapping between OpenACC and CUDA parallelism levels Legacy PGI Compilers	3	6544	April 16, 2015
paralle + independent and kernels + vector_length() Legacy PGI Compilers	5	4029	August 20, 2012
Help understanding gang and vector specification Legacy PGI Compilers	1	2389	November 26, 2012
OpenACC Gang-Vector Performance Legacy PGI Compilers	4	3668	June 18, 2015
Questions about 'vector' and 'gang' Legacy PGI Compilers	5	7014	February 10, 2016
"!$acc routine vector" leads to 32 threads per gan Legacy PGI Compilers	4	2295	June 5, 2023

Computing multiple elements per thread in OpenACC

Related topics