I have the following code that sums over several worker-local arrays:
#include <iostream>
#include <stdlib.h>
int main() {
int tot = 0;
#pragma acc parallel loop gang worker vector_length(32) reduction(+:tot)
for( int i = 0; i < 100; i++ ){
int arr[32];
#pragma acc data create(arr[:32])
#pragma acc loop vector
for( int j = 0; j < 32; j++ ){ arr[j] = i + j; }
int s = 0;
#pragma acc loop vector reduction(+:s)
for( int j = 0; j < 32; j++ ){ s += arr[j]; }
tot += s;
}
printf( "%d\n", tot );
return 0;
}
It does not provide the correct result unless I remove either the worker or the vector_length(32) clauses. I can’t make sense of it as these clauses should only change the way the loops are split between resources. But somehow it seems to duplicate some of the loop iterations.
If I’m correct, arr is private to the worker, so there should be no atomics needed. And the result, although wrong, it reproducible.
It’s a known issue (TPR#31948). Early in the compilation phase (before OpenACC is applied), the local array declarations gets hoisted out of the loop. If the loop was using just “gang”, then it can automatically privatize the array. However it doesn’t if multiple parallel levels are used (“gang worker” in this case).
The workaround is to manually hoist the array and then explicitly add it to a “private” clause.
I added your example to the previous report.
% cat test.cpp
#include <iostream>
#include <stdlib.h>
int main() {
int tot = 0;
int arr[32];
#pragma acc parallel loop gang worker vector_length(32) reduction(+:tot) private(arr)
for( int i = 0; i < 100; i++ ){
// #pragma acc data create(arr[:32]) << remove as data regions can only be used from the host
#pragma acc loop vector
for( int j = 0; j < 32; j++ ){ arr[j] = i + j; }
int s = 0;
#pragma acc loop vector reduction(+:s)
for( int j = 0; j < 32; j++ ){ s += arr[j]; }
tot += s;
}
printf( "%d\n", tot );
return 0;
}
% nvc++ -Minfo=accel test.cpp
% a.out
208000