Hi glaciya2018,
In your first post, as I’ve noted in my previous answers, the i and j declared in main are different from the i and j’s declared in the for loops. Though it appears that you may have figured that out given you remove the inner loop declarations in the second post.
However, there’s still several issues with your second post.
First, there’s no “#pragma kernels”. This should be “#pragma acc kernels”.
Second, the data region is using a “copyin” so the variables aren’t being copied back from the device. The second data region is nested within the first so is extraneous. Besides having no scope, the variables are present on the device already, and given OpenACC “present_or” semantics, no copy will be made. To synchronize memory within a data region, you’ll want to use the “update” directive.
#pragma acc update self(a[:N],j,i)
Though it’s simpler in this case to simply change “copyin” to “copy” so the data is copied back at the end of the data region.
Last, why are you globalizing the loop index variables? As I’ve noted before, this will prevent the code from being parallelized. In order to parallelize the code, the loop index variables must be private, else it causes a race condition.
Here’s the corrected code. I left in the global index variables, but as you can see from the compiler feedback messages, the code will get run sequentially on the device.
% cat test.cpp
#include<iostream>
#include<openacc.h>
#define N 5
using namespace std;
int main(){
int a[N];
int i,j=0;
#pragma acc data copy(a[:N],i,j) //can not use present(i,j),device can not find any data
{
#pragma acc kernels present(a[:N],j,i)
{
for(i=0;i<N;i++){
for(j=0;j<N;j++){
a[i]=i+j;
}
}
}
}
cout<<"a[N]"<<a[N-1]<<endl;
cout<<i<<endl;
cout<<j<<endl;
}
% nvc++ -acc -Minfo=accel test.cpp; a.out
main:
10, Generating copy(a[:],j,i) [if not already present]
12, Generating present(j,i,a[:])
Accelerator serial kernel generated
Generating NVIDIA GPU code
13, #pragma acc loop seq
14, #pragma acc loop seq
13, Loop carried scalar dependence for i at line 15,17
Parallelization would require privatization of array a[i]
Conditional loop will be executed in scalar mode
Loop carried scalar dependence for i at line 17
Parallelization would require privatization of array a[i]
Conditional loop will be executed in scalar mode
14, Parallelization would require privatization of array a[i]
15, Accelerator restriction: induction variable live-out from loop: j
16, Accelerator restriction: induction variable live-out from loop: j
a[N]8
5
5
Here’s a second example where I remove the global index variables, thus allowing the outer loop to be parallelized. While not needed, I also switched the code to use the “update” directive so you can see it in context.
% cat test.cpp
#include<iostream>
#include<openacc.h>
#define N 5
using namespace std;
int main(){
int a[N];
#pragma acc data create(a[:N])
{
#pragma acc update device(a[:N])
#pragma acc kernels present(a[:N])
{
for(int i=0;i<N;i++){
for(int j=0;j<N;j++){
a[i]=i+j;
}
}
}
#pragma acc update self(a[:N])
}
cout<<"a[N]"<<a[N-1]<<endl;
}
% nvc++ -acc -Minfo=accel test.cpp ; a.out
main:
9, Generating create(a[:]) [if not already present]
12, Generating update device(a[:])
Generating present(a[:])
13, Loop is parallelizable
14, Loop carried reuse of a prevents parallelization
Inner sequential loop scheduled on accelerator
Generating NVIDIA GPU code
13, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
14, #pragma acc loop seq
14, Loop carried reuse of a prevents parallelization
20, Generating update self(a[:])
a[N]8