Hi,
I am confused about the effect of a “acc enter data copyin” statement in combination with a kernels or parallel statement.
This simplistic dummy code:
int n = 5;
int* a = new int[n];
#pragma acc enter data pcopyin(n)
#pragma acc parallel loop pcopyin(n,a[0:n])
for(int i=0;i<n;++i){
a[i] = 0;
}
produces the following compiler output (PGI 15.7-15.10)
145, Generating enter data copyin(n)
Loop without integer trip count will be executed in sequential mode
Generating copyin(n,a[:n])
Accelerator kernel generated
Generating Tesla code
However, if I comment out the “enter data copyin” statement:
int n = 5;
int* a = new int[n];
//#pragma acc enter data pcopyin(n)
#pragma acc parallel loop pcopyin(n,a[0:n])
for(int i=0;i<n;++i){
a[i] = 0;
}
the compiler gives a more pleasing output in terms of parallelization:
145, Generating copyin(n,a[:n])
Accelerator kernel generated
Generating Tesla code
148, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
My understanding of the two cases is that they should only differ in when the scalar n is being copied to the GPU but not affect loop parallelization. In the first case n should be copied to the GPU when hitting the enter data statement (hitting the loop n should then be present already). In the second case n should be copied to the GPU when hitting the parallel statement.
Doing the same exercise with the kernels statement results in the following:
int n = 5;
int* a = new int[n];
#pragma acc enter data pcopyin(n)
#pragma acc kernels pcopyin(n,a[0:n])
{
#pragma acc loop independent
for(int i=0;i<n;++i){
a[i] = 0;
}
}
148, Generating enter data copyin(n)
Generating copyin(n,a[:n])
150, Conditional loop will be executed in scalar mode
Accelerator scalar kernel generated
and again commenting out enter data copyin
int n = 5;
int* a = new int[n];
//#pragma acc enter data pcopyin(n)
#pragma acc kernels pcopyin(n,a[0:n])
{
#pragma acc loop independent
for(int i=0;i<n;++i){
a[i] = 0;
}
}
leads to parallelization
148, Generating copyin(n,a[:n])
150, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
150, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
If I do enter data copyin of the array I don’t see it affecting the parallelization. So it seems explicit data handling of loop bound variables is not supported. Leaving the data handling of variable n to the compiler, i.e. no explicit pcopyin statements leads to always copying in the value from the host, which in most cases is OK. But it might affect performance if you have many loop nests with the same bounds as you will always have to transfer the few bytes over to the GPU before starting your kernels.
Thanks,
LS