Getting rid of " size of the GPU copy of an array depe

I think I have a straight forward parallel loop, however the compiler keeps telling me :

Accelerator region ignored
48, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
49, Accelerator restriction: size of the GPU copy of ‘myVar’ is unknown
Accelerator restriction: loop contains unsupported statement type
Accelerator restriction: one or more arrays have unknown size
61, Accelerator restriction: unsupported statement type

where myVar is an array used for calculation.
#pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1])
copyout (dose[0:MAX-1])
for ( in = 0; in < total; in++){
current = 0;
ptr=0;

for (int i = 0; i < Nb; i++){
for (int j = 0; j < bepB_; j++){

current = j+ptr;
calculate(dose, mVar[in], Beam[current] );
}
}
}

Arrays are global but I don’t think that might cause a problem. What am I doing wrong? Why the compiler can not see the size of myVar even if I’ve declared them with *rectrict?
Pleaee, any help would be much appreciated._

Hi luiceur,

The basic problem here is that by default the compiler will try to accelerate your inner loops. It’s failing since the j loop bounds is not known and Beam’s index is calculated. I’m not sure why it’s complaining about mVar, but the “unsupported statement type” is most likely pointer arithmetic.

What I’d try doing is have the compiler just accelerate the “in” loop. Something like:

#pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1]) \
copyout (dose[0:MAX-1]) 
{
#pragma acc for kernel independent
for ( in = 0; in < total; in++){ 
...

Though, it’s difficult to tell exactly what’s going on with just a code snipit. If this doesn’t work, if you could post a complete example I would appreciate it.

  • Mat

No I am getting:

36, Accelerator region ignored
45, Accelerator restriction: invalid loop
48, Accelerator restriction: invalid loop


Here is the complete parallel region:

#pragma acc region copyin (beamletOffAngleArray[0:Nbeamlets-1], \
pSource[0:params.Nbeams-1], pBody[0:TotVoxel-1], \
beamletsPerBeam[0:params.Nbeams-1], pBeamlet[0:Nbeamlets-1],\
inputPointsArray[0:Nbeamlets-1], beamletFluence[0:Nbeamlets-1], \
accumulateDepth1D[0:fSize-1], accumulate_n_DepthArray[0:Nbeamlets-1], \
inputFittingParams[0:params.NPar_depth-1], inputA[0: Asize-1],\
structure[0:TotVoxel-1] ) copyout(dose[0:TotVoxel-1])   
    {
#pragma acc for kernel independent private (currentBeamlet)
        for ( ivoxel = 0; ivoxel < TotVoxel; ivoxel++){
            currentBeamlet = 0;
            ptr=0;
            for (int i = 0; i < params.Nbeams; i++) { 
                for (int j = 0; j <beamletsPerBeam; j++){

                    currentBeamlet = j+ptr;                              
  
                    if(dis_p_foot <= thres) {

                          dose[ivoxel] =dose[ivoxel] + calculation;

                    }//if(dis_p_foot <= thres)

                }// end of for j

                ptr = ptr + beamletsPerBeam[i];

            } //end of for i
            
        }//end of for ivoxels
        
    }//end of #pragma region

Why am I getting those errors?

I’m not sure. Can you please either post or send to a trs@pgroup.com a complete example?

Thanks,
Mat

Done! I have found quite a lot of internal errors compiling with the latest version.

Hi Luis,

I checked with PGI Customer Support and we did get some source code from you, but it doesn’t contain any accelerator regions nor the variables listed in your post. Can you double check that you sent us the correct code?

Also, please include all dependent header files.

Thanks,
Mat

Did you finally get the code? Were you able to reproduce the internal errors?

Hi Luis,

Customer Service hasn’t received any mail from you since sending you the request for additional header files. Also, the code snip-it you sent, as well as the one posted here, do not appear in anywhere in the two C source files.

Can you try again?

Thanks,
Mat

Hi,

I sent the whole code, a large zip file with inputs included. Shall I type any reference to avoid any mix up on your input emails?
We are using PGI Accelerator -> pgcc 11.3-0 64-bit target on x86-64 Linux. I’ll type my name on the header. Luis

Hi Luis,

I got the code, however it works fine for me. I suspect something else is going on. I’ve sent you a note and we can correspond via email till we determine what’s wrong.

Thanks,
Mat

Hi,

Thanks a lot for your note. I’ve notices that wasn’t the complete code as I just wanted to send a simple example but I think it was too simple. That one also worked for me. I’ve sent you the code that is making an impossible task to run by the compiler.
Although, it seems to be a complex code, the loops should be easy to get them in parallel as only the first one should be in parallel and the other within the kernel should run sequentially.

Please have a quick look a let me know if you spot something.

Best and again, thanks a lot!

Any luck with that? Have you been able to reproduce the errors?

Hi,
has this issue of GPU read/write loop dependence issue been resolved in this example code? I am having a similar problem and would like to see the solution. If there are other articles on solving this issue from a general standpoint, that would be useful too.

Thanks

Avi

Hi Avi,

has this issue of GPU read/write loop dependence issue been resolved in this example code?

luiceur’s issue had more to do with the use of struct’s, and had been fixed in the 1.9 compiler.

In general, read/write loop dependence is often caused when there is a potential that pointer’s overlap. Users must either explicitly declare pointer’s with the “restrict” attribute or compile with the flag “-Msafeptr”. This is true for Acceleration as well as Auto-Vectorization and Auto-parallelization.

Can you be more specific as to the issue you are encountering?

Thanks,
Mat

Mat,
here is my problem…

// start of code snippet…

#pragma acc region
{
for (i=0;i < npart;i++) {
ax = x_dh_inv;
ay = y
dh_inv;
az = zdh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy
nx + iz*nxy]++;
}
}
}
} // end of acc region

// … end of code snippet

The size of array pos is not defined until the outer loop parameters are set. So I get the “Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop” message. So I assume I need to do a copyin of pos array with the full or a fixed size that was malloc’d.
\

  1. My first question is where exactly do I need to do it – at the “i” loop where the pragma is set, or elsewhere?
    \
  2. Do I need to move the pragma acc definition to the inner loops i.e. before the “iz” loop, or will the dependence not allow any parallelization or vectorization? Or are there any other definitions that can be set with the pragma calls that will allow to work around this dependence to allow for vectorization within a thread block?

    Thanks

    Avi_

Hi Avi,

The size of the loops needs to be known when entering a compute region. Hence you either need to accelerate just the outer “i” loop or the inner three loops. Which is best will depend on the sizes of the loops.

For option #1, use the “kernel” loop directive clause to tell the compiler that you want the body of the outer loop to be the compute kernel. You’ll also need to use the “independent” clause since the compiler wont be able to tell that all calculations on pos are independent. Of course, if multiple iterations of “i” do modify the same element in pois, then you will get wrong answers, so only use “independent” if you are sure. Finally, you may need to use the copyout clause for pos since the compiler wont be to determine the size of pos given the information from the loop.

The code would look something like:

#pragma acc region copyout (pos[0:posSize-1])
{
#pragma acc for kernel, independent
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
} // end of acc region

For option #2, you would push the compute region around the inner loops. Since pos uses computed index, you’ll need the independent clause again. I have put in a request that the compiler should be able to figure out all values of a compute index are independent when only constant variables and loop index are used in the expression. It’s a quite difficult operation for the compiler to do, but it’s quite common so our engineers are working on it. Finally, I’d add a data region around the “i” loop so pos is copied only once.


#pragma acc data region copyout(pos[0:posSize-1])
{
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
#pragma acc region
{
#pragma acc for independent
for (iz=iz0;iz <= iz1;iz++) {
#pragma acc for independent
for (iy=iy0;iy <= iy1;iy++) {
#pragma acc for independent
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
}
} // end of acc region

Hope this helps,
Mat

Hi Mat,
thanks for both your suggestions.

Without going into details, neither of them worked just right, but with different issues in each case. Since now its getting into code nitty-gritty, logistically, should I continue to post them here or create a separate (new) topic, or take it offline and iterate via email?

Once I know your preference, I’ll act accordingly.

Thanks for your help.

– Avi

Hi Avi,

If the code is big or proprietary, let’s take it offline. Send the code to PGI customer service (trs@pgroup.com) and ask them to sent it to me.

If you have a small example, the User Forum is fine. Though, please start a new thread.

Note that our lab is shut down from Dec 23rd to Jan 3rd, but I’ll check in a few times during the week.

  • Mat

Mat,
I sent the source tarball with compile, run outputs for the 2 options. Let me know if you received it.

Thanks

Avi

Got it. I’ll take a look later today.

  • Mat