Hello,

I got one unexpected result when I used openacc directives for parallel calculating .

Here is the test code below:

#include

#include<openacc.h>

using namespace std;

int main(){

int counter=1000;

int counterb=5;

int j=0;

int sum=0;

#pragma acc enter data copyin(j)

#pragma acc parallel present(sum)

{

#pragma acc loop

for(int i=0;i<counter-1;i++){//in final step i should be 999

```
#pragma acc loop
for(int j=0;j<counterb-1;j++){//in final step j should be 4
sum=i+j;//in final step sum =4+ 999=1003
}
}
```

}

#pragma acc exit data copyout(j,sum)

cout<<“sum is :”<<sum<<endl;

cout<<“j is :”<<j<<endl;

return 0;

}

The result shows:

main:

12, Generating enter data copyin(j)

Generating present(sum)

Generating NVIDIA GPU code

14, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

17, #pragma acc loop seq

19, Accelerator restriction: induction variable live-out from loop: sum

34, Generating exit data copyout(sum,j)

xxxxxx$ ./test2

sum is :1001

j is :0

Furthermore, if I changed #pragma acc parallel present(sum) into #pragma acc parallel, there is no change to the result.

May I know how I can get the j value in the final step in #pragma area? And why is the sum equal to 1001 instead of 1003? Why is j equal to 0 instead of 4?

Thanks very much!

Hi glaciya2018,

When run in parallel, the order in which the loop iterations are executed is non-deterministic. To get the value of sum from the last iteration, all previous iterations need to be executed in order (sequentially). When run in parallel, the value of sum will be whichever thread was last to update it.

For “j”, you have two separate variables here. The one scoped in main, and a second scoped to the inner loop. Though even if you didn’t redeclare it on the inner loop given “j” is a loop index variable, it must be private to each thread. Otherwise the code will have a race condition when one thread sets it’s value and then a different thread sets it to another value.

If you change your code so it does a sum of all the index values, then you can parallelize the code using a reduction clause. For example:

```
% cat test.cpp
#include <iostream>
#include<openacc.h>
using namespace std;
int main(){
int counter=1000;
int counterb=5;
int sum=0;
#pragma acc parallel loop collapse(2) reduction(+:sum)
for(int i=0;i<counter-1;i++){
for(int j=0;j<counterb-1;j++){
sum+=i+j;
}
}
cout<<"sum is :"<<sum<<endl;
return 0;
}
// first run serially to get the expected value
% nvc++ test.cpp ; a.out
sum is :1999998
// now run in parallel
% nvc++ test.cpp -acc -Minfo=accel; a.out
main:
8, Generating NVIDIA GPU code
10, #pragma acc loop gang, vector(128) collapse(2) /* blockIdx.x threadIdx.x */
Generating reduction(+:sum)
11, /* blockIdx.x threadIdx.x collapsed */
8, Generating implicit copy(sum) [if not already present]
sum is :1999998
```

Hope this helps,

Mat

Hello Mat,

Thanks very much. It helps a lot.