Accelerator clause: upper bound for dimension 0 of array

Hi there,

I create a SoA for my points.

struct Coordinates
{
 double *x;   // x-coordinates
 double *y;   //y-coordinates
 double *z;   // z-coordinates 
 int n;            
 };

in a sequential version, I passed the struct by reference in order to init data .

to move date to device I used copyin directive, but I am not sure what to do with the dereference operator. Did I in a correct way? because
I get these clause.

Accelerator clause: upper bound for dimension 0 of array ‘s’ is unknown

rcoordinate:
154, Accelerator clause: upper bound for dimension 0 of array ‘s’ is unknown
Generating enter data copyin(s[:1])
155, Generating enter data copyin(s->z[:myn],s->y[:myn],s->x[:myn])

I think that arrays x,y and z where created in The device.



void rcoordinate(coordinate *s,char fname[10])
139 {
140  int myn;
141  FILE *rfile;
142      myn=(*s).n;
143      
144      rfile=openfile(fname,"r"); 
145          for (unsigned int i=0; i < myn; i++) {
146              double tmpx,tmpy,tmpz; 
147              fscanf(rfile,"%lf %lf %lf\n",&tmpx,&tmpy,&tmpz);
148              (*s).x[i]=tmpx;
149              (*s).y[i]=tmpy;
150              (*s).z[i]=tmpz;
151             }
152      closefile(rfile);
153 
154 #pragma acc enter data copyin(s)
155 #pragma acc enter data copyin((*s).x[0:myn],(*s).y[0:myn],(*s).z[0:myn])
156 
157 }

Thanks in advance for your help.

Hi azzulrd,

I think you should be ok, but I would add “s[0:1]” to the copyin clause. Technically, just putting “s” in the clause means that you want to copy in the pointer, not what it points to. While the PGI compiler is interrupting what you meant, it’s probably better to be explicit.

Also, you may want to consider moving these directives just after the call to rcoordinate rather than in it. Especially if it’s only called once. Then you don’t need to worry about the extra level of indirection.

Something like:

rcoordinate(&s, "file.dat");
#pragma acc enter data copyin(s) 
#pragma acc enter data copyin(s.x[0:s.n], s.y[0:s.n], s.z[0:s.n])

Or possibly, couple the allocation of the arrays to the creation of the device, and then use updates to synchronize the data. I personally prefer this since I can then match my allocation/free with the enter/exit data regions. Unless you need to manage the amount of memory being used on the GPU at any one time, I find this method cleaner.

s.n = <somenumber>;
s.x = (double *) malloc (sizeof(double)*s.n);
s.y = (double *) malloc (sizeof(double)*s.n);
s.z = (double *) malloc (sizeof(double)*s.n);
#pragma acc enter data copyin(s) 
#pragma acc enter data create(s.x[0:s.n], s.y[0:s.n], s.z[0:s.n]
... 
rcoordinate(&s, "file.dat");
#pragma acc update device(s.x[0:s.n], s.y[0:s.n], s.z[0:s.n])

-Mat

Hi Mat,

Well you are right, I use memory on GPU at any time.
I have functions that I call more than one time. I followed
your advice, I tested both options and works well.
Now I can access GPU memory and this going well. However when a call a function, this give different result vs CPU.

I read about private clauses, but I think I do not need use it. Since Variable are already in threads.

(I also read this blog)

https://www.pgroup.com/userforum/viewtopic.php?p=18596&sid=53619abc18604ae160a2e3fb954c0ab1

So,I have a some questions,

Have I used private clauses?
what happen with variable when a pass by reference, here I used a temporal variable and in after loops, I just reassigned.
what happen when I call fixdistance function, is the same that serial version?
Serial Code

void ccforces(coordinate *r,coordinate *f, double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
{
  int myn;
      myn=(*r).n;

        for (unsigned int i=0; i <myn-1; i++)
          for (unsigned int j=i+1; j <myn; j++) {
              double dx,dy,dz,sr;
              dx=(*r).x[i]-(*r).x[j];
              dy=(*r).y[i]-(*r).y[j];
              dz=(*r).z[i]-(*r).z[j];

              fixdistance(&dx,boxx);   //fix distance to minimal distance x
              fixdistance(&dy,boxy);   //fix distance to minimal distance y 
              fixdistance(&dz,boxz);   //fix distance to minimal distance z

              sr=ccr(dx,dy,dz);        // r 
              if (sr < rcut)
               {
                double gr,grx,gry,grz;
                 gr=ccgradient(sr,eps0);  // gradient 
                 *upot+=ccupot(sr,eps0);  //upot
                 grx=gr*dx;
                 gry=gr*dy;
                 grz=gr*dz;

                 (*f).x[i]+=grx;
                 (*f).y[i]+=gry;
                 (*f).z[i]+=grz;

                 (*f).x[j]-=grx;
                 (*f).y[j]-=gry;
                 (*f).z[j]-=grz;
               }
            }

}

OpenACC Code

void ccforces(coordinate *r,coordinate *f, double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
{
  int myn;
  double upott;
  double dx,dy,dz,sr;
  double gr,grx,gry,grz;

   upott=*upot;
   myn=(*r).n;

        #pragma acc parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\
                                  present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn])
        for (unsigned int i=0; i <myn-1; i++)
          for (unsigned int j=i+1; j <myn; j++) {
              dx=(*r).x[i]-(*r).x[j];
              dy=(*r).y[i]-(*r).y[j];
              dz=(*r).z[i]-(*r).z[j];

              fixdistance(&dx,boxx);   
              fixdistance(&dy,boxy);   
              fixdistance(&dz,boxz);   

              sr=ccr(dx,dy,dz);        // r
              if (sr < rcut)
               {

                 gr=ccgradient(sr,eps0);  // gradient
                 //*upot+=ccupot(sr,eps0);  //upot
                 upott+=ccupot(sr,eps0);  //upot
                 grx=gr*dx;
                 gry=gr*dy;
                 grz=gr*dz;

                 (*f).x[i]+=grx;
                 (*f).y[i]+=gry;
                 (*f).z[i]+=grz;

                 (*f).x[j]-=grx;
                 (*f).y[j]-=gry;
                 (*f).z[j]-=grz;
               }
            }
*upot+=upott;
}



#pragma acc routine seq
void fixdistance(double *d, double dbox)
{
   if (*d >= 0.5*dbox)
      *d-=2.0;
    else
      if (*d< -0.5*dbox)
      *d+=1.0
}

I really appreciate your help.

Hi azzulrd,

Most likely since you’re passing “dx”, “dy”, and “dz” by reference, the compiler is not implicitly privatizing these scalars. Since the address of these scalars “escapes”, the compiler can’t tell if addresses are taken by other variables.

What’s the output from the compiler feedback? (i.e. -Minfo=accel). This will show what’s happening.

To fix, try manually privatizing them:

       #pragma acc parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\ 
                                  present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn]) private(dx,dy,dz)
        for (unsigned int i=0; i <myn-1; i++) 
          for (unsigned int j=i+1; j <myn; j++) {

If that doesn’t help, please post a reproducing example so I can take a look.

-Mat

Hi Mat,

I did, privatized that I used in my inner loop. However it did not work.
I fixed with atomic directive . Output is correct.

 #pragma acc  parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\
                                  present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn]) private (dx,dy,dz,radio) private (gr,grx,gry,grz)
        for (unsigned int i=0; i <myn-1; i++)
          for (unsigned int j=i+1; j <myn; j++) {



  #pragma acc atomic update
                 (*f.x[i]+=grx;
                #pragma acc atomic update
                 (*f).y[i]+=gry;
                #pragma acc atomic update
                 (*f).z[i]+=grz;
                #pragma acc atomic update
                 (*f).x[j]-=grx;
                #pragma acc atomic update
                 (*fwrite).y[j]-=gry;
                #pragma acc atomic update
                 (*f).z[j]-=grz;
               }

My issue now, is that I just gain a little time vs GPU. For Instance in serial version it takes 3 min to run and in acc version just take 1 min.

Compiler reported data dependency

computeForces:
182, Generating present(r->z[:myn],r[:1],f[:1],r->x[:myn],r->y[:myn],f->z[:myn],f->x[:myn],f->y[:myn])
Accelerator kernel generated
Generating Tesla code
182, Generating reduction(+:upott)
184, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
185, #pragma acc loop seq
185, Loop carried dependence of f->x-> prevents parallelization
Loop carried backward dependence of f->x-> prevents vectorization
Loop carried dependence of f->z->,f->y-> prevents parallelization
Loop carried backward dependence of f->z->,f->y-> prevents vectorization

In order to eliminate data dependency I added an aux force array (SoA) fuax (it has the same data that f) then output are again different vs CPU.

void computeForces(coordinate *r,coordinate *f,coordinate *faux,double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
...
#pragma acc  parallel loop reduction(+:upott) present (r[0:1],(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn])\
                                  present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn])\
                                  present (faux[0:1],(*faux).x[0:myn],(*faux).y[0:myn],(*faux).z[0:myn])\
                                  private (dx,dy,dz,radio) private (gr,grx,gry,grz) 

...
       //------------------------------------- 
                 (*faux).x[i]=(*f).x[i]+grx;       
                 (*faux).y[i]=(*f).y[i]+gry;         
                 (*faux).z[i]=(*f).z[i]+grz;
                 (*faux).x[j]=(*f).x[j]-grx;
                 (*faux).y[j]=(*f).y[j]-gry;
                 (*faux).z[j]=(*f).z[j]-grz;
       //------------------------------------

...

Checking output is a similar error I had without atomic directive. So, I added this. In this case compiler reported:


PGC-S-0155-Invalid atomic expression (coordinate.c: 213)
PGC-S-0155-Invalid atomic region. (coordinate.c: 214)
PGC-S-0155-Invalid atomic expression (coordinate.c: 215)
PGC-S-0155-Invalid atomic region. (coordinate.c: 216)
PGC-S-0155-Invalid atomic expression (coordinate.c: 217)
PGC-S-0155-Invalid atomic region. (coordinate.c: 218)
PGC-S-0155-Invalid atomic expression (coordinate.c: 219)
PGC-S-0155-Invalid atomic region. (coordinate.c: 220)
PGC-S-0155-Invalid atomic expression (coordinate.c: 221)
PGC-S-0155-Invalid atomic region. (coordinate.c: 222)
PGC-S-0155-Invalid atomic expression (coordinate.c: 223)
PGC-S-0155-Invalid atomic region. (coordinate.c: 240)

whole compiler output:
$ make
pgcc -acc -ta=tesla:cc35 -Minfo=accel -Msafeptr -c99 -c coordinate.c
updateArray:
53, Generating present(u[:1],u->z[:myn],d[:1],u->x[:myn],u->y[:myn],d->z[:myn],d->x[:myn],d->y[:myn])
Accelerator kernel generated
Generating Tesla code
55, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x /
zeroArray:
67, Generating present(c->z[:myn],c[:1],c->x[:myn],c->y[:myn])
Accelerator kernel generated
Generating Tesla code
68, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
computeUkin:
83, Generating present(v->z[:myn],v[:1],v->x[:myn],v->y[:myn])
Accelerator kernel generated
Generating Tesla code
83, Generating reduction(+:ukin)
84, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
updateVscale:
95, Generating present(v->z[:myn],v[:1],v->x[:myn],v->y[:myn])
Accelerator kernel generated
Generating Tesla code
96, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
updateArrayBC:
117, Generating present(u->z[:myn],u[:1],u->x[:myn],u->y[:myn])
Accelerator kernel generated
Generating Tesla code
118, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
initArray:
167, Generating enter data copyin(s[:1])
168, Generating enter data copyin(s->z[:myn],s->y[:myn],s->x[:myn])
PGC-S-0155-Invalid atomic expression (coordinate.c: 213)
PGC-S-0155-Invalid atomic region. (coordinate.c: 214)
PGC-S-0155-Invalid atomic expression (coordinate.c: 215)
PGC-S-0155-Invalid atomic region. (coordinate.c: 216)
PGC-S-0155-Invalid atomic expression (coordinate.c: 217)
PGC-S-0155-Invalid atomic region. (coordinate.c: 218)
PGC-S-0155-Invalid atomic expression (coordinate.c: 219)
PGC-S-0155-Invalid atomic region. (coordinate.c: 220)
PGC-S-0155-Invalid atomic expression (coordinate.c: 221)
PGC-S-0155-Invalid atomic region. (coordinate.c: 222)
PGC-S-0155-Invalid atomic expression (coordinate.c: 223)
PGC-S-0155-Invalid atomic region. (coordinate.c: 240)
computeForces:
183, Generating present(r[:1],r->z[:myn],faux[:1],r->x[:myn],r->y[:myn],f->z[:myn],f->x[:myn],f->y[:myn],faux->z[:myn],f[:1],faux->x[:myn],faux->y[:myn])
Accelerator kernel generated
Generating Tesla code
187, #pragma acc loop gang /
blockIdx.x /
190, #pragma acc loop vector(128) /
threadIdx.x /
190, Loop is parallelizable
initforcesll:
256, Generating enter data copyin(c[:1])
257, Generating enter data copyin(c->z[:myn],c->y[:myn],c->x[:myn])
mirrorForces:
273, Generating present(f->z[:myn],f->x[:myn],f->y[:myn],faux->z[:myn],faux[:1],f[:1],faux->x[:myn],faux->y[:myn])
Accelerator kernel generated
Generating Tesla code
275, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x */
PGC/x86-64 Linux 17.10-0: compilation completed with severe errors
make: *** [coordinate.o] Error 2


How could I elimitate data dependency, how could I apply atomic directive in this case?

How can I make that the code run faster.

  • I read about UVA and other acc features, but my architecture is TeslaK20m, and read is not supported.


    Thanks in advance for your help.

Luz

Hi Luz,

I fixed with atomic directive

Aah, I missed the dependency, so yes, the atomic is needed.

My issue now, is that I just gain a little time vs GPU. For Instance in serial version it takes 3 min to run and in acc version just take 1 min.

Atomics can be costly, especially on older devices such as a K20. They have gotten faster on newer devices such as P100 and V100s, but still should be avoided if possible.

In order to eliminate data dependency I added an aux force array (SoA) fuax (it has the same data that f) then output are again different vs CPU.

Wont this have the same dependency? The dependency is that you have overlapping indices between the i and j loops so accessing:

                 (*faux).x[i]=(*f).x[i]+grx;   
                  ...
                 (*faux).x[j]=(*f).x[j]-grx;

Is really the same dependency as:

                 (*f).x[i]+=grx; 
                 ....
                 (*f).x[j]-=grx;

You might be able to get rid of 3 of the atomics if you use two arrays, one for “i” and one for “j”. But I don’t think that will help much.

What I’m thinking is if you break-up the problem so the gr values are captured to temp 2D arrays. This way you remove the dependency between the i and j loops. I’d also change the j loop to be rectangular with an if condition. This way both loops are fully parallelizable.

Then have two more loops nest, one parallelized across i and an second parallelized across j.

Something like what I have below. Big caveat! Not having the full code, I can’t test that my logic is actually correct. So while I think it might work, I’m not positive.

% cat test.c
#include <stdlib.h>

struct Coordinates
{
 double *x;   // x-coordinates
 double *y;   //y-coordinates
 double *z;   // z-coordinates
 int n;
};
typedef struct Coordinates coordinate;

#pragma acc routine seq
double ccr(int,int,int);
#pragma acc routine seq
double ccgradient(double,double);
#pragma acc routine seq
double ccupot(double,double);
#pragma acc routine seq
void fixdistance(double *d, double dbox);

void ccforces(coordinate *r,coordinate *f, double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
{
  int myn;
  double upott;
  double dx,dy,dz,sr;
  double gr,grx,gry,grz;
  double **x,**y,**z;

   upott=*upot;
   myn=(*r).n;
   x=(double**) malloc(sizeof(double*)*myn);
   y=(double**) malloc(sizeof(double*)*myn);
   z=(double**) malloc(sizeof(double*)*myn);
  for (unsigned int i=0; i <myn; i++) {
      x[i] = (double*) malloc(sizeof(double*)*myn);
      y[i] = (double*) malloc(sizeof(double*)*myn);
      z[i] = (double*) malloc(sizeof(double*)*myn);
   }
   #pragma acc data create(x[0:myn][0:myn],y[0:myn][0:myn],z[0:myn][0:myn]) \
        present (r,(*r).x[0:myn],(*r).y[0:myn],(*r).z[0:myn]) \
        present (f,(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn])
   {
        #pragma acc parallel loop reduction(+:upott)
        for (unsigned int i=0; i <myn-1; i++) {
          #pragma acc loop
          for (unsigned int j=0; j <myn; j++) {
            if (j > i) {
              dx=(*r).x[i]-(*r).x[j];
              dy=(*r).y[i]-(*r).y[j];
              dz=(*r).z[i]-(*r).z[j];

              fixdistance(&dx,boxx);
              fixdistance(&dy,boxy);
              fixdistance(&dz,boxz);

              sr=ccr(dx,dy,dz);        // r
              if (sr < rcut)
               {

                 gr=ccgradient(sr,eps0);  // gradient
                 //*upot+=ccupot(sr,eps0);  //upot
                 upott+=ccupot(sr,eps0);  //upot
                 grx=gr*dx;
                 gry=gr*dy;
                 grz=gr*dz;
                 x[i][j]=grx;
                 y[i][j]=gry;
                 z[i][j]=grz;
               }
            } else {
                 x[i][j]=0.0;
                 y[i][j]=0.0;
                 z[i][j]=0.0;
            }

            }}
        *upot+=upott;
        #pragma acc parallel loop
        for (unsigned int i=0; i <myn-1; i++)
          for (unsigned int j=0; j <myn; j++) {
                 (*f).x[i]+=x[i][j];
                 (*f).y[i]+=y[i][j];
                 (*f).z[i]+=z[i][j];
          }

        #pragma acc parallel loop
        for (unsigned int j=0; j <myn; j++)
           for (unsigned int i=0; i <myn-1; i++)  {
                 (*f).x[j]-=x[i][j];
                 (*f).y[j]-=y[i][j];
                 (*f).z[j]-=z[i][j];
          }
     }

     for (unsigned int i=0; i <myn; i++) {
        free(x[i]);
        free(y[i]);
        free(z[i]);
     }
     free(x);
     free(y);
     free(z);
}
% pgcc -c test.c -Minfo=accel -ta=tesla:cc70
ccforces:
     39, Generating present(r[:])
         Generating create(x[:myn][:myn],y[:myn][:myn])
         Generating present(r->z[:myn],f[:],r->x[:myn],r->y[:myn],f->z[:myn],f->x[:myn],f->y[:myn])
         Generating create(z[:myn][:myn])
     43, Accelerator kernel generated
         Generating Tesla code
         44, #pragma acc loop gang /* blockIdx.x */
             Generating reduction(+:upott)
         46, #pragma acc loop vector(128) /* threadIdx.x */
     43, Generating implicit copy(upott)
     46, Loop is parallelizable
     78, Accelerator kernel generated
         Generating Tesla code
         79, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         80, #pragma acc loop seq
     80, Complex loop carried dependence of x->->,f->x-> prevents parallelization
         Loop carried dependence of f->x-> prevents parallelization
         Loop carried backward dependence of f->x-> prevents vectorization
         Complex loop carried dependence of y->->,f->y->,z->->,f->z-> prevents parallelization
     86, Accelerator kernel generated
         Generating Tesla code
         87, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         88, #pragma acc loop seq
     88, Complex loop carried dependence of f->x->,x->-> prevents parallelization
         Loop carried dependence of f->x-> prevents parallelization
         Loop carried backward dependence of f->x-> prevents vectorization
         Complex loop carried dependence of f->y->,y->->,f->z->,z->-> prevents parallelization

Hi Mat,

I reproduced what you explained me. There is no problem with compiler.

ComputeForces= ccforces
output compiler:

computeForces:
201, Generating present(f[:1])
Generating create(x[:myn][:myn],y[:myn][:myn])
Generating present(f->z[:myn],f->x[:myn],f->y[:myn])
Generating create(z[:myn][:myn])
204, Accelerator kernel generated
Generating Tesla code
204, Generating reduction(+:upott)
206, #pragma acc loop gang /* blockIdx.x /
208, #pragma acc loop vector(128) /
threadIdx.x /
204, Generating implicit copyin(r[:1])
208, Loop is parallelizable
238, Accelerator kernel generated
Generating Tesla code
239, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x /
240, #pragma acc loop seq
240, Complex loop carried dependence of f->x-> prevents parallelization
Loop carried dependence of f->x-> prevents parallelization
Loop carried backward dependence of f->x-> prevents vectorization
246, Accelerator kernel generated
Generating Tesla code
247, #pragma acc loop gang, vector(128) /
blockIdx.x threadIdx.x */
248, #pragma acc loop seq
248, Loop carried dependence of f->x-> prevents parallelization
Loop carried backward dependence of f->x-> prevents vectorization
Complex loop carried dependence of f->x->,f->z->,f->y-> prevents parallelization

But, when I run I got

x lives at 0x157f6b0 size 40 partially present
x lives at 0x157f4b0 size 40 present
x lives at 0x157f540 size 40 present
x lives at 0x157f5d0 size 40 present
x lives at 0x157f660 size 40 present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.5, threadid=1

I thought it was because of the myn size (maybe memory is not enough), first I set myn=1000000 , So I reduce the size, to 1000, and 5, but the output compiler is the same.

I run cuda-memcheck . It says in the end of the output.

========= ERROR SUMMARY: 0 errors
[luz]$


void computeForces(coordinate *r,coordinate *f,double boxx, double boxy, double boxz,double rcut,double eps0,double *upot)
{
  int myn;
  double upott;
  double dx,dy,dz,radio,gr,grx,gry,grz;
  double ** restrict x,** restrict y,** restrict z;

  upott=*upot;
  myn=(*r).n;
 //double * restrict x;
       x=(double**) malloc(sizeof(double*)*myn);
         if (x == NULL) {
             printf("No memory Available did not create x array for Forces\n");
             exit (0);
            }
       y=(double**) malloc(sizeof(double*)*myn);
         if (y == NULL) {
             printf("No memory Available did not create y array for Forces\n");
             exit (0);
           }
       z=(double**) malloc(sizeof(double*)*myn);
          if (z == NULL) {
             printf("No memory Available did not create z array for Forces\n");
             exit (0);
           }

       for (unsigned int i=0; i <myn; i++) {
           x[i] = (double*) malloc(sizeof(double*)*myn);
           y[i] = (double*) malloc(sizeof(double*)*myn);
           z[i] = (double*) malloc(sizeof(double*)*myn);
           }

      #pragma acc data create(x[0:myn][0:myn],y[0:myn][0:myn],z[0:myn][0:myn])\
                  present (f[0:1],(*f).x[0:myn],(*f).y[0:myn],(*f).z[0:myn])
     {
      #pragma acc parallel loop reduction(+:upott) private (dx,dy,dz,radio) private (gr,grx,gry,grz)
                  //private (dx,dy,dz,radio) private (gr,grx,gry,grz)
        for (unsigned int i=0; i <myn-1; i++) {
          #pragma acc loop 
          for (unsigned int j=0; j <myn; j++) {
        if (i>j) {
              dx=(*r).x[i]-(*r).x[j];
              dy=(*r).y[i]-(*r).y[j];
              dz=(*r).z[i]-(*r).z[j];

              fixdistance(&dx,boxx);   //fix distance to minimal distance x
              fixdistance(&dy,boxy);   //fix distance to minimal distance y
              fixdistance(&dz,boxz);   //fix distance to minimal distance z

              radio=computeRadio(dx,dy,dz);        // r
              if (radio < rcut) {
                 gr=computeGradient(radio,eps0);  // gradient
                 upott+=computeUpot(radio,eps0);  //upot
                 grx=gr*dx;
                 gry=gr*dy;
                 grz=gr*dz;
                 x[i][j]=grx;
                 y[i][j]=gry;
                 z[i][j]=grz;
               } // r<rcut
            } else { //i>j
                    x[i][j]=0.0;
                    y[i][j]=0.0;
                    z[i][j]=0.0;
                   } //else
        }} //forj, fori

*upot+=upott;

       #pragma acc parallel loop
       for (unsigned int i=0; i <myn-1; i++)
         for (unsigned int j=0; j <myn; j++) {
               (*f).x[i]+=x[i][j];
               (*f).y[i]+=y[i][j];
               (*f).z[i]+=z[i][j];
         }

       #pragma acc parallel loop
       for (unsigned int j=0; j <myn; j++)
         for (unsigned int i=0; i <myn-1; i++)  {
             (*f).x[j]-=x[i][j];
             (*f).y[j]-=y[i][j];
             (*f).z[j]-=z[i][j];
         }

      for (unsigned int i=0; i <myn; i++) {
           free(x[i]);
           free(y[i]);
           free(z[i]);
       }

       free(x);
       free(y);
       free(z);
} //pragma datacreate
}

I think we can allocate n=5 without problem.
Should we allocate memory size correspond to shared per SM?
it depends on What?
should I try with windows sliding?

Thanks in advance for your help.

Luz

Sorry about that, but since I don’t have your full application, I can’t test the code.

One issue I see it a typo in my code:

The error is:

  for (unsigned int i=0; i <myn; i++) { 
      x[i] = (double*) malloc(sizeof(double*)*myn); 
      y[i] = (double*) malloc(sizeof(double*)*myn); 
      z[i] = (double*) malloc(sizeof(double*)*myn); 
   }

But instead of the size of a “double*”, it should be the size of a “double”.

  for (unsigned int i=0; i <myn; i++) { 
      x[i] = (double*) malloc(sizeof(double)*myn); 
      y[i] = (double*) malloc(sizeof(double)*myn); 
      z[i] = (double*) malloc(sizeof(double)*myn); 
   }

Though, both a “double” and “double*” are size of 8, so it might not matter.


Another issue I see in your version is that you free the memory before the end of the data region. You should close the data region before freeing the memory which is uses.

During the run of the program, the runtime keeps track of what data is on the device, it’s corresponding host address, and the size, in a “present” table. A “partially present” error means that the there’s some overlap with another variable or the same host address is used again with a different size. Why it’s happening here, I’m not quite sure.

If fixing the two small issues listed above don’t fix the partially present error, can you please post the full output from the present table dump? That might give clues as to which variable “x” is overlapping.

-Mat