Computing multiple elements per thread in OpenACC

assume we have the following code:

#pragma acc kernels
#pragma acc loop gang(16) vector(32)
for (int i=0; i<2048; i++) {
  // do something with array[i]

With PGI Compiler 12.9, this meant that we created a grid of size 16 and blocks of size 32 so that each CUDA thread would execute 4 elements.
However, with PGI Compiler 13.1 this is not possible anymore. If I denote vector and gang size, then the gang size will be ignored during execution (however, the compiler feedback will tell me that is uses 16 gangs). With 13.1, the compiler automatically executes the loop with a grid size of 64 (and vector size 32).
Is this a bug or intended? If the latter, why?
Kind regards, Sandra

Hi Sandra,

No, this doesn’t look correct. I’ve opened up a problem report (TPR#19149) and sent it to our engineers for further investigation.


Just one addition: If I use a gang schedule for an outer loop, the vector schedule for the inner one of a loop nest and specify both sizes, then the specified size of the gang loop will also be ignored:

#pragma acc parallel vector_length(64) num_gangs(128)
#pragma acc loop gang
        for( int j = 0; j < n; j++)
#pragma acc loop vector
            for( int i = 0; i < m; i++ ) {..}

The output of ACC_NOTIFY shows that block=64, but grid=8190 (which is n in my case).


TPR 19149 has been fixed in the current 13.5 release.