#pragma acc kernels
#pragma acc loop gang(16) vector(32)
for (int i=0; i<2048; i++) {
// do something with array[i]
}
With PGI Compiler 12.9, this meant that we created a grid of size 16 and blocks of size 32 so that each CUDA thread would execute 4 elements.
However, with PGI Compiler 13.1 this is not possible anymore. If I denote vector and gang size, then the gang size will be ignored during execution (however, the compiler feedback will tell me that is uses 16 gangs). With 13.1, the compiler automatically executes the loop with a grid size of 64 (and vector size 32).
Is this a bug or intended? If the latter, why?
Kind regards, Sandra
Thanks.
Just one addition: If I use a gang schedule for an outer loop, the vector schedule for the inner one of a loop nest and specify both sizes, then the specified size of the gang loop will also be ignored:
#pragma acc parallel vector_length(64) num_gangs(128)
#pragma acc loop gang
for( int j = 0; j < n; j++)
{
#pragma acc loop vector
for( int i = 0; i < m; i++ ) {..}
}
The output of ACC_NOTIFY shows that block=64, but grid=8190 (which is n in my case).
Sandra