strange behavior with enter data copyin

Hi,

I am confused about the effect of a “acc enter data copyin” statement in combination with a kernels or parallel statement.
This simplistic dummy code:

 int n = 5;
 int* a = new int[n];
#pragma acc enter data pcopyin(n)
#pragma acc parallel loop pcopyin(n,a[0:n])
   for(int i=0;i<n;++i){
     a[i] = 0;
   }

produces the following compiler output (PGI 15.7-15.10)

    145, Generating enter data copyin(n)
         Loop without integer trip count will be executed in sequential mode
         Generating copyin(n,a[:n])
         Accelerator kernel generated
         Generating Tesla code

However, if I comment out the “enter data copyin” statement:

  int n = 5;
 int* a = new int[n];
//#pragma acc enter data pcopyin(n)
#pragma acc parallel loop pcopyin(n,a[0:n])
   for(int i=0;i<n;++i){
     a[i] = 0;
   }

the compiler gives a more pleasing output in terms of parallelization:

   145, Generating copyin(n,a[:n])
         Accelerator kernel generated
         Generating Tesla code
        148, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

My understanding of the two cases is that they should only differ in when the scalar n is being copied to the GPU but not affect loop parallelization. In the first case n should be copied to the GPU when hitting the enter data statement (hitting the loop n should then be present already). In the second case n should be copied to the GPU when hitting the parallel statement.

Doing the same exercise with the kernels statement results in the following:

 int n = 5;
 int* a = new int[n];
#pragma acc enter data pcopyin(n)
#pragma acc kernels pcopyin(n,a[0:n])
 {
#pragma acc loop independent
   for(int i=0;i<n;++i){
     a[i] = 0;
   }
 }



    148, Generating enter data copyin(n)
         Generating copyin(n,a[:n])
    150, Conditional loop will be executed in scalar mode
         Accelerator scalar kernel generated

and again commenting out enter data copyin

 int n = 5;
 int* a = new int[n];
 //#pragma acc enter data pcopyin(n)
#pragma acc kernels pcopyin(n,a[0:n])
 {
#pragma acc loop independent
   for(int i=0;i<n;++i){
     a[i] = 0;
   }
 }

leads to parallelization

    148, Generating copyin(n,a[:n])
    150, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        150, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

If I do enter data copyin of the array I don’t see it affecting the parallelization. So it seems explicit data handling of loop bound variables is not supported. Leaving the data handling of variable n to the compiler, i.e. no explicit pcopyin statements leads to always copying in the value from the host, which in most cases is OK. But it might affect performance if you have many loop nests with the same bounds as you will always have to transfer the few bytes over to the GPU before starting your kernels.

Thanks,
LS

Hi LS,

What’s happening is that the address of “n” is being passed to a runtime routine to create the data on the device. Hence the compiler must assume that another reference to “n” is created and that the value of “n” could change during the execution of the loop. This inhibits parallelization of the loop.

It does not occur in the second case since the dependency analysis of the loop occurs before device data create routine is called.

Loop bounds variables are special case and shouldn’t be place in data regions. Is there a particular reason why you want to put “n” in a data region?

Note that scalars are private by default and by putting them in a data region you are making them global. Unless you are intentionally making them global, I advise letting the compiler implicitly privatize them since it allows for better optimization of the compute kernel.

  • Mat

Hi Mat

I came across this behavior because I wanted to be explicit in terms of data transfer. I had come across a situation where an enum scalar was not properly copied in / updated so I was afraid this could happen in other places, too. Unfortunately, I was not able to reproduce it in a toy example that I could post.

Are loop bounds variables also handled correctly if they reside in a struct e.g.

struct bounds {
int nx;
int ny;
int nz;
};
bounds max = {2,4,10};
#pragma acc kernels loop
for(int i=0;i< max.nx;++i){
//some code
}

Thanks,
LS

Are loop bounds variables also handled correctly if they reside in a struct e.g.

Yes, this is fine.

  • Mat