I think I have a straight forward parallel loop, however the compiler keeps telling me :
Accelerator region ignored
48, Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop
49, Accelerator restriction: size of the GPU copy of ‘myVar’ is unknown
Accelerator restriction: loop contains unsupported statement type
Accelerator restriction: one or more arrays have unknown size
61, Accelerator restriction: unsupported statement type
where myVar is an array used for calculation. #pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1])
copyout (dose[0:MAX-1])
for ( in = 0; in < total; in++){
current = 0;
ptr=0;
for (int i = 0; i < Nb; i++){
for (int j = 0; j < bepB_; j++){
current = j+ptr;
calculate(dose, mVar[in], Beam[current] );
}
}
}
Arrays are global but I don’t think that might cause a problem. What am I doing wrong? Why the compiler can not see the size of myVar even if I’ve declared them with *rectrict?
Pleaee, any help would be much appreciated._
The basic problem here is that by default the compiler will try to accelerate your inner loops. It’s failing since the j loop bounds is not known and Beam’s index is calculated. I’m not sure why it’s complaining about mVar, but the “unsupported statement type” is most likely pointer arithmetic.
What I’d try doing is have the compiler just accelerate the “in” loop. Something like:
#pragma acc region copyin (bepB[0:MAX-1], mVar[0:MAX-1], Beam[0:MAX-1]) \
copyout (dose[0:MAX-1])
{
#pragma acc for kernel independent
for ( in = 0; in < total; in++){
...
Though, it’s difficult to tell exactly what’s going on with just a code snipit. If this doesn’t work, if you could post a complete example I would appreciate it.
I checked with PGI Customer Support and we did get some source code from you, but it doesn’t contain any accelerator regions nor the variables listed in your post. Can you double check that you sent us the correct code?
Customer Service hasn’t received any mail from you since sending you the request for additional header files. Also, the code snip-it you sent, as well as the one posted here, do not appear in anywhere in the two C source files.
I sent the whole code, a large zip file with inputs included. Shall I type any reference to avoid any mix up on your input emails?
We are using PGI Accelerator → pgcc 11.3-0 64-bit target on x86-64 Linux. I’ll type my name on the header. Luis
I got the code, however it works fine for me. I suspect something else is going on. I’ve sent you a note and we can correspond via email till we determine what’s wrong.
Thanks a lot for your note. I’ve notices that wasn’t the complete code as I just wanted to send a simple example but I think it was too simple. That one also worked for me. I’ve sent you the code that is making an impossible task to run by the compiler.
Although, it seems to be a complex code, the loops should be easy to get them in parallel as only the first one should be in parallel and the other within the kernel should run sequentially.
Please have a quick look a let me know if you spot something.
Hi,
has this issue of GPU read/write loop dependence issue been resolved in this example code? I am having a similar problem and would like to see the solution. If there are other articles on solving this issue from a general standpoint, that would be useful too.
has this issue of GPU read/write loop dependence issue been resolved in this example code?
luiceur’s issue had more to do with the use of struct’s, and had been fixed in the 1.9 compiler.
In general, read/write loop dependence is often caused when there is a potential that pointer’s overlap. Users must either explicitly declare pointer’s with the “restrict” attribute or compile with the flag “-Msafeptr”. This is true for Acceleration as well as Auto-Vectorization and Auto-parallelization.
Can you be more specific as to the issue you are encountering?
#pragma acc region
{
for (i=0;i < npart;i++) {
ax = x_dh_inv;
ay = ydh_inv;
az = zdh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iynx + iz*nxy]++;
}
}
}
} // end of acc region
// … end of code snippet
The size of array pos is not defined until the outer loop parameters are set. So I get the “Accelerator restriction: size of the GPU copy of an array depends on values computed in this loop” message. So I assume I need to do a copyin of pos array with the full or a fixed size that was malloc’d.
\
My first question is where exactly do I need to do it – at the “i” loop where the pragma is set, or elsewhere?
\
Do I need to move the pragma acc definition to the inner loops i.e. before the “iz” loop, or will the dependence not allow any parallelization or vectorization? Or are there any other definitions that can be set with the pragma calls that will allow to work around this dependence to allow for vectorization within a thread block?
The size of the loops needs to be known when entering a compute region. Hence you either need to accelerate just the outer “i” loop or the inner three loops. Which is best will depend on the sizes of the loops.
For option #1, use the “kernel” loop directive clause to tell the compiler that you want the body of the outer loop to be the compute kernel. You’ll also need to use the “independent” clause since the compiler wont be able to tell that all calculations on pos are independent. Of course, if multiple iterations of “i” do modify the same element in pois, then you will get wrong answers, so only use “independent” if you are sure. Finally, you may need to use the copyout clause for pos since the compiler wont be to determine the size of pos given the information from the loop.
The code would look something like:
#pragma acc region copyout (pos[0:posSize-1])
{
#pragma acc for kernel, independent
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
for (iz=iz0;iz <= iz1;iz++) {
for (iy=iy0;iy <= iy1;iy++) {
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
} // end of acc region
For option #2, you would push the compute region around the inner loops. Since pos uses computed index, you’ll need the independent clause again. I have put in a request that the compiler should be able to figure out all values of a compute index are independent when only constant variables and loop index are used in the expression. It’s a quite difficult operation for the compiler to do, but it’s quite common so our engineers are working on it. Finally, I’d add a data region around the “i” loop so pos is copied only once.
#pragma acc data region copyout(pos[0:posSize-1])
{
for (i=0;i < npart;i++) {
ax = x[i]*dh_inv;
ay = y[i]*dh_inv;
az = z[i]*dh_inv;
ix0 = ax - dn;
iy0 = ay - dn;
iz0 = az - dn;
ix1 = ax + dn;
iy1 = ay + dn;
iz1 = az + dn;
#pragma acc region
{
#pragma acc for independent
for (iz=iz0;iz <= iz1;iz++) {
#pragma acc for independent
for (iy=iy0;iy <= iy1;iy++) {
#pragma acc for independent
for (ix=ix0;ix <= ix1;ix++) pos[ix + iy*nx + iz*nxy]++;
}
}
}
}
} // end of acc region
Without going into details, neither of them worked just right, but with different issues in each case. Since now its getting into code nitty-gritty, logistically, should I continue to post them here or create a separate (new) topic, or take it offline and iterate via email?
Once I know your preference, I’ll act accordingly.