Hi Mat,
I did, privatized that I used in my inner loop. However it did not work.
I fixed with atomic directive . Output is correct.
#pragma acc parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\
present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn]) private (dx,dy,dz,radio) private (gr,grx,gry,grz)
for (unsigned int i=0; i <myn-1; i++)
for (unsigned int j=i+1; j <myn; j++) {
#pragma acc atomic update
(*f.x[i]+=grx;
#pragma acc atomic update
(*f).y[i]+=gry;
#pragma acc atomic update
(*f).z[i]+=grz;
#pragma acc atomic update
(*f).x[j]-=grx;
#pragma acc atomic update
(*fwrite).y[j]-=gry;
#pragma acc atomic update
(*f).z[j]-=grz;
}
My issue now, is that I just gain a little time vs GPU. For Instance in serial version it takes 3 min to run and in acc version just take 1 min.
Compiler reported data dependency
computeForces:
182, Generating present(r->z[:myn],r[:1],f[:1],r->x[:myn],r->y[:myn],f->z[:myn],f->x[:myn],f->y[:myn])
Accelerator kernel generated
Generating Tesla code
182, Generating reduction(+:upott)
184, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
185, #pragma acc loop seq
185, Loop carried dependence of f->x-> prevents parallelization
Loop carried backward dependence of f->x-> prevents vectorization
Loop carried dependence of f->z->,f->y-> prevents parallelization
Loop carried backward dependence of f->z->,f->y-> prevents vectorization
In order to eliminate data dependency I added an aux force array (SoA) fuax (it has the same data that f) then output are again different vs CPU.
void computeForces(coordinate *r,coordinate *f,coordinate *faux,double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
...
#pragma acc parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\
present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn])\
present (faux[0:1],(*faux).x[0:myn],(*faux).y[0:myn],(*faux).z[0:myn])\
private (dx,dy,dz,radio) private (gr,grx,gry,grz)
...
//-------------------------------------
(*faux).x[i]=(*f).x[i]+grx;
(*faux).y[i]=(*f).y[i]+gry;
(*faux).z[i]=(*f).z[i]+grz;
(*faux).x[j]=(*f).x[j]-grx;
(*faux).y[j]=(*f).y[j]-gry;
(*faux).z[j]=(*f).z[j]-grz;
//------------------------------------
...
Checking output is a similar error I had without atomic directive. So, I added this. In this case compiler reported:
PGC-S-0155-Invalid atomic expression (coordinate.c: 213)
PGC-S-0155-Invalid atomic region. (coordinate.c: 214)
PGC-S-0155-Invalid atomic expression (coordinate.c: 215)
PGC-S-0155-Invalid atomic region. (coordinate.c: 216)
PGC-S-0155-Invalid atomic expression (coordinate.c: 217)
PGC-S-0155-Invalid atomic region. (coordinate.c: 218)
PGC-S-0155-Invalid atomic expression (coordinate.c: 219)
PGC-S-0155-Invalid atomic region. (coordinate.c: 220)
PGC-S-0155-Invalid atomic expression (coordinate.c: 221)
PGC-S-0155-Invalid atomic region. (coordinate.c: 222)
PGC-S-0155-Invalid atomic expression (coordinate.c: 223)
PGC-S-0155-Invalid atomic region. (coordinate.c: 240)
whole compiler output:
$ make
pgcc -acc -ta=tesla:cc35 -Minfo=accel -Msafeptr -c99 -c coordinate.c
updateArray:
53, Generating present(u[:1],u->z[:myn],d[:1],u->x[:myn],u->y[:myn],d->z[:myn],d->x[:myn],d->y[:myn])
Accelerator kernel generated
Generating Tesla code
55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x /
zeroArray:
67, Generating present(c->z[:myn],c[:1],c->x[:myn],c->y[:myn])
Accelerator kernel generated
Generating Tesla code
68, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
computeUkin:
83, Generating present(v->z[:myn],v[:1],v->x[:myn],v->y[:myn])
Accelerator kernel generated
Generating Tesla code
83, Generating reduction(+:ukin)
84, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
updateVscale:
95, Generating present(v->z[:myn],v[:1],v->x[:myn],v->y[:myn])
Accelerator kernel generated
Generating Tesla code
96, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
updateArrayBC:
117, Generating present(u->z[:myn],u[:1],u->x[:myn],u->y[:myn])
Accelerator kernel generated
Generating Tesla code
118, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x /
initArray:
167, Generating enter data copyin(s[:1])
168, Generating enter data copyin(s->z[:myn],s->y[:myn],s->x[:myn])
PGC-S-0155-Invalid atomic expression (coordinate.c: 213)
PGC-S-0155-Invalid atomic region. (coordinate.c: 214)
PGC-S-0155-Invalid atomic expression (coordinate.c: 215)
PGC-S-0155-Invalid atomic region. (coordinate.c: 216)
PGC-S-0155-Invalid atomic expression (coordinate.c: 217)
PGC-S-0155-Invalid atomic region. (coordinate.c: 218)
PGC-S-0155-Invalid atomic expression (coordinate.c: 219)
PGC-S-0155-Invalid atomic region. (coordinate.c: 220)
PGC-S-0155-Invalid atomic expression (coordinate.c: 221)
PGC-S-0155-Invalid atomic region. (coordinate.c: 222)
PGC-S-0155-Invalid atomic expression (coordinate.c: 223)
PGC-S-0155-Invalid atomic region. (coordinate.c: 240)
computeForces:
183, Generating present(r[:1],r->z[:myn],faux[:1],r->x[:myn],r->y[:myn],f->z[:myn],f->x[:myn],f->y[:myn],faux->z[:myn],f[:1],faux->x[:myn],faux->y[:myn])
Accelerator kernel generated
Generating Tesla code
187, #pragma acc loop gang / blockIdx.x /
190, #pragma acc loop vector(128) / threadIdx.x /
190, Loop is parallelizable
initforcesll:
256, Generating enter data copyin(c[:1])
257, Generating enter data copyin(c->z[:myn],c->y[:myn],c->x[:myn])
mirrorForces:
273, Generating present(f->z[:myn],f->x[:myn],f->y[:myn],faux->z[:myn],faux[:1],f[:1],faux->x[:myn],faux->y[:myn])
Accelerator kernel generated
Generating Tesla code
275, #pragma acc loop gang, vector(128) / blockIdx.x threadIdx.x */
PGC/x86-64 Linux 17.10-0: compilation completed with severe errors
make: *** [coordinate.o] Error 2
How could I elimitate data dependency, how could I apply atomic directive in this case?
How can I make that the code run faster.
- I read about UVA and other acc features, but my architecture is TeslaK20m, and read is not supported.
Thanks in advance for your help.
Luz