Getting rid of " size of the GPU copy of an array depe

luiceur1 · July 27, 2011, 8:57am

I think I have a straight forward parallel loop, however the compiler keeps telling me :

Accelerator region ignored
48, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
49, Accelerator restriction: size of the GPU copy of ‘myVar’ is unknown
Accelerator restriction: loop contains unsupported statement type
Accelerator restriction: one or more arrays have unknown size
61, Accelerator restriction: unsupported statement type

where myVar is an array used for calculation.
#pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1])
copyout (dose[0:MAX-1])
for ( in = 0; in < total; in++){
current = 0;
ptr=0;

for (int i = 0; i < Nb; i++){
for (int j = 0; j < bepB_; j++){

current = j+ptr;
calculate(dose, mVar[in], Beam[current] );
}
}
}

Arrays are global but I don’t think that might cause a problem. What am I doing wrong? Why the compiler can not see the size of myVar even if I’ve declared them with *rectrict?
Pleaee, any help would be much appreciated._

MatColgrove · July 27, 2011, 3:20pm

Hi luiceur,

The basic problem here is that by default the compiler will try to accelerate your inner loops. It’s failing since the j loop bounds is not known and Beam’s index is calculated. I’m not sure why it’s complaining about mVar, but the “unsupported statement type” is most likely pointer arithmetic.

What I’d try doing is have the compiler just accelerate the “in” loop. Something like:

#pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1]) \
copyout (dose[0:MAX-1]) 
{
#pragma acc for kernel independent
for ( in = 0; in < total; in++){ 
...

Though, it’s difficult to tell exactly what’s going on with just a code snipit. If this doesn’t work, if you could post a complete example I would appreciate it.

Mat

luiceur1 · July 27, 2011, 3:59pm

No I am getting:

36, Accelerator region ignored
45, Accelerator restriction: invalid loop
48, Accelerator restriction: invalid loop

Here is the complete parallel region:

#pragma acc region copyin (beamletOffAngleArray[0:Nbeamlets-1], \
pSource[0:params.Nbeams-1], pBody[0:TotVoxel-1], \
beamletsPerBeam[0:params.Nbeams-1], pBeamlet[0:Nbeamlets-1],\
inputPointsArray[0:Nbeamlets-1], beamletFluence[0:Nbeamlets-1], \
accumulateDepth1D[0:fSize-1], accumulate_n_DepthArray[0:Nbeamlets-1], \
inputFittingParams[0:params.NPar_depth-1], inputA[0: Asize-1],\
structure[0:TotVoxel-1] ) copyout(dose[0:TotVoxel-1])   
    {
#pragma acc for kernel independent private (currentBeamlet)
        for ( ivoxel = 0; ivoxel < TotVoxel; ivoxel++){
            currentBeamlet = 0;
            ptr=0;
            for (int i = 0; i < params.Nbeams; i++) { 
                for (int j = 0; j <beamletsPerBeam; j++){

                    currentBeamlet = j+ptr;                              
  
                    if(dis_p_foot <= thres) {

                          dose[ivoxel] =dose[ivoxel] + calculation;

                    }//if(dis_p_foot <= thres)

                }// end of for j

                ptr = ptr + beamletsPerBeam[i];

            } //end of for i
            
        }//end of for ivoxels
        
    }//end of #pragma region

Why am I getting those errors?

MatColgrove · July 27, 2011, 8:06pm

I’m not sure. Can you please either post or send to a trs@pgroup.com a complete example?

Thanks,
Mat

luiceur1 · July 27, 2011, 8:38pm

Done! I have found quite a lot of internal errors compiling with the latest version.

MatColgrove · August 1, 2011, 4:37pm

Hi Luis,

I checked with PGI Customer Support and we did get some source code from you, but it doesn’t contain any accelerator regions nor the variables listed in your post. Can you double check that you sent us the correct code?

Also, please include all dependent header files.

Thanks,
Mat

luiceur1 · August 10, 2011, 10:56pm

Did you finally get the code? Were you able to reproduce the internal errors?

MatColgrove · August 11, 2011, 3:15pm

Hi Luis,

Customer Service hasn’t received any mail from you since sending you the request for additional header files. Also, the code snip-it you sent, as well as the one posted here, do not appear in anywhere in the two C source files.

Can you try again?

Thanks,
Mat

luiceur1 · August 12, 2011, 12:11pm

Hi,

I sent the whole code, a large zip file with inputs included. Shall I type any reference to avoid any mix up on your input emails?
We are using PGI Accelerator → pgcc 11.3-0 64-bit target on x86-64 Linux. I’ll type my name on the header. Luis

MatColgrove · August 12, 2011, 3:48pm

Hi Luis,

I got the code, however it works fine for me. I suspect something else is going on. I’ve sent you a note and we can correspond via email till we determine what’s wrong.

Thanks,
Mat

luiceur1 · August 14, 2011, 5:44pm

Hi,

Thanks a lot for your note. I’ve notices that wasn’t the complete code as I just wanted to send a simple example but I think it was too simple. That one also worked for me. I’ve sent you the code that is making an impossible task to run by the compiler.
Although, it seems to be a complex code, the loops should be easy to get them in parallel as only the first one should be in parallel and the other within the kernel should run sequentially.

Please have a quick look a let me know if you spot something.

Best and again, thanks a lot!

luiceur1 · August 16, 2011, 6:32pm

Any luck with that? Have you been able to reproduce the errors?

avip1 · December 20, 2011, 3:03am

Hi,
has this issue of GPU read/write loop dependence issue been resolved in this example code? I am having a similar problem and would like to see the solution. If there are other articles on solving this issue from a general standpoint, that would be useful too.

Thanks

Avi

MatColgrove · December 20, 2011, 3:49pm

Hi Avi,

has this issue of GPU read/write loop dependence issue been resolved in this example code?

luiceur’s issue had more to do with the use of struct’s, and had been fixed in the 1.9 compiler.

In general, read/write loop dependence is often caused when there is a potential that pointer’s overlap. Users must either explicitly declare pointer’s with the “restrict” attribute or compile with the flag “-Msafeptr”. This is true for Acceleration as well as Auto-Vectorization and Auto-parallelization.

Can you be more specific as to the issue you are encountering?

Thanks,
Mat

avip1 · December 20, 2011, 4:33pm

Mat,
here is my problem…

// start of code snippet…

#pragma acc region
{
for (i=0;i < npart;i++) {
ax = x_dh_inv;
ay = ydh_inv;
az = zdh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iynx + iz*nxy]++;
}
}
}
} // end of acc region

// … end of code snippet

The size of array pos is not defined until the outer loop parameters are set. So I get the “Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop” message. So I assume I need to do a copyin of pos array with the full or a fixed size that was malloc’d.
\

My first question is where exactly do I need to do it – at the “i” loop where the pragma is set, or elsewhere?
\
Do I need to move the pragma acc definition to the inner loops i.e. before the “iz” loop, or will the dependence not allow any parallelization or vectorization? Or are there any other definitions that can be set with the pragma calls that will allow to work around this dependence to allow for vectorization within a thread block?

Thanks

Avi_

MatColgrove · December 20, 2011, 5:48pm

Hi Avi,

The size of the loops needs to be known when entering a compute region. Hence you either need to accelerate just the outer “i” loop or the inner three loops. Which is best will depend on the sizes of the loops.

For option #1, use the “kernel” loop directive clause to tell the compiler that you want the body of the outer loop to be the compute kernel. You’ll also need to use the “independent” clause since the compiler wont be able to tell that all calculations on pos are independent. Of course, if multiple iterations of “i” do modify the same element in pois, then you will get wrong answers, so only use “independent” if you are sure. Finally, you may need to use the copyout clause for pos since the compiler wont be to determine the size of pos given the information from the loop.

The code would look something like:

#pragma acc region copyout (pos[0:posSize-1])
{
#pragma acc for kernel, independent
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
} // end of acc region

For option #2, you would push the compute region around the inner loops. Since pos uses computed index, you’ll need the independent clause again. I have put in a request that the compiler should be able to figure out all values of a compute index are independent when only constant variables and loop index are used in the expression. It’s a quite difficult operation for the compiler to do, but it’s quite common so our engineers are working on it. Finally, I’d add a data region around the “i” loop so pos is copied only once.

#pragma acc data region copyout(pos[0:posSize-1])
{
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
#pragma acc region
{
#pragma acc for independent
for (iz=iz0;iz <= iz1;iz++) {
#pragma acc for independent
for (iy=iy0;iy <= iy1;iy++) {
#pragma acc for independent
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
}
} // end of acc region

Hope this helps,
Mat

avip1 · December 21, 2011, 5:08pm

Hi Mat,
thanks for both your suggestions.

Without going into details, neither of them worked just right, but with different issues in each case. Since now its getting into code nitty-gritty, logistically, should I continue to post them here or create a separate (new) topic, or take it offline and iterate via email?

Once I know your preference, I’ll act accordingly.

Thanks for your help.

– Avi

MatColgrove · December 21, 2011, 6:33pm

Hi Avi,

If the code is big or proprietary, let’s take it offline. Send the code to PGI customer service (trs@pgroup.com) and ask them to sent it to me.

If you have a small example, the User Forum is fine. Though, please start a new thread.

Note that our lab is shut down from Dec 23rd to Jan 3rd, but I’ll check in a few times during the week.

Mat

avip1 · December 22, 2011, 2:17am

Mat,
I sent the source tarball with compile, run outputs for the 2 options. Let me know if you received it.

Thanks

Avi

MatColgrove · December 22, 2011, 4:38pm

Got it. I’ll take a look later today.

Mat