Couple of questions (nested loops, loop bounds, etc.)

Hello,

I’ve been trying to parallelize my scientific code using OpenACC. Although I have been thinking that using OpenACC is very easy, mapping it to my code has proved to be quite difficult. It is certainly nice on easy examples but not so easy on larger nested loops and several arrays.

Anyways, I have following longer questions (excuse me for the length):

  1. What is the best way to parallelize following loop?
   

long cnti, cntj, cntk;
double psi2, psi2lin, tmp;

#pragma acc parallel loop collapse(3) //private (cnti, cntj, cntk, psi2, psi2lin, tmp)
for(cnti = 0; cnti < NXDUMMY; cnti ++) {
     for(cntj = 0; cntj < NYDUMMY; cntj ++) {
         for(cntk = 0; cntk < NZDUMMY; cntk ++) {
            psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
            psi2lin = psi2 * G;
            tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
            psi[cnti][cntj][cntk] *= exp(-tmp);
	  } 
     }
}

As you can see I have done it by interesting the collapse statement in pragma. However to be honest I did that buy trying out different solutions of most which have proven not to work. Some are from this forum, some are from scarce tutorials on OpenACC.

This one works, but I’m not sure why, or weather it is the best solution. What does collapse exactly do there? I’m native OpenMP programmer and I only know to parallelize the outmost loop into threads while inner loops are done sequentially in each thread. Here I’m not sure what is happening with inner loops - are they divided in threads, workers or gangs whatever you call them? I’m not a CUDA programmer so those terms are vague to me. I know that outmost loop should be divided in gangs right? But what is happening inside or how is it synchronized?

On the other hand if I don’t put collapse statement there, compiler gives out information about multiple loop dependent variables (psi and pot for example), although logically they should be divided in the first loop (cnti) within gangs, and shouldn’t have any loop dependencies from then on right? cntj and cntk should be independent for each gang right? Also uncommenting or commenting those private variables seems to make no difference in the end result (which is correct compared to original OpenMP code), although from what I have been thought all variables that have been declared on host (psi2 and tmp for example) or copied to the device are shared by default unless declared private right? How come then that the result is correct if all the variables are shared between the gangs?

Keep in mind that I have copied psi and pot arrays on the device prior to execution on previous code so data movement is not an issue here.

  1. As you saw I put NXDUMMY, NYDUMMY and NZDUMMY array bounds in my loops. Those are const values defined by #define directive in the beginning of my program. If I don’t do that, and resort to Nx, Ny and Nz variables that are read from input file in runtime, the program returns following error:

call to clSetKernelArg returned error -51: invalid argument size

Keep in mind that the accelerator I’m using is Radeon R9 280X, and thus under the hood it is using OpenCL and not CUDA.

Does this mean that I have to define my array sizes in compile time (via #define) and not in runtime? Bear in mind that my arrays are dynamically allocated with malloc and then copied to the device.

  1. When I start my program with previous code it runs that code in function call (previous code is basically a function with psi as only parameter). Nothing else happens in this program besides array allocations, initialization, deinitialization and previous function called say 1000 times in the loop. So basically that is the only real work going on there. However when I watch system monitor all of my 6 cores on the processor are 100% busy. I don’t understand why? Arrays are copied to the device prior to 1000 loops and copied back after those loops, device should be doing the loops, and the only job that host has is 1000 iterations of the function call while function body itself is executed on the device. The only thing I can think of is that the invocation of
#pragma acc parallel loop collapse(3)

1000 times is producing total processor usage, although I think that is highly unlikely. Do you have any clues why this happens?

As an end note, 1000 iterations in OpenACC variant give about 2 times faster execution then the OpenMP variant. At least for this single function. It is a nice result, although I believe it can be even better.

Thank you in advance.

Hi timek28,

Lots of questions here, but let’s try and take them in turn.

What does collapse exactly do there?

This works similar to OpenMP collapse. The body of the inner loop will become your kernel with the three loops collapsed into a single loop which is parallelized across the gangs and vectors. Each kernel would compute one element of “psi[cnti][cntj][cntk]”

On the other hand if I don’t put collapse statement there, compiler gives out information about multiple loop dependent variables (psi and pot for example), although logically they should be divided in the first loop (cnti) within gangs, and shouldn’t have any loop dependencies from then on right

The dependency is most like due to pointer aliasing. The compiler can’t tell if “psi” and “pot” are really pointing to the same memory, therefore must assume they do. To fix either use the compiler flag “-Msafeptr” to assert to the compiler that all pointers are disjoint or use the C99 “restrict” attribute when you declare your pointers (i.e. double *** restrict psi, double *** restrict pot).

For “kernels” compute regions, you can also use the “loop independent” directive to assert that the loop is parallel and compiler can ignore it’s dependency analysis. “independent” is the default for “loop” directives in “parallel” compute regions. Given you are using “parallel”, I suspect you get these message only when you don’t have “#pragma acc loop” before each loop.

Let’s look at each of these cases:

Case #1, using collapse doesn’t get the dependency issue because the “loop” is applied to all three loops and “parallel loop” (as well as “kernel loop independent”), asserts to the compiler that loop is parallel and no dependency analysis is needed.

% cat test1.c
#include <accelmath.h>
int foo(double *** pot, double *** psi, int NYDUMMY, int NZDUMMY, int NXDUMMY, double dt, double G)
{

long cnti, cntj, cntk;
double psi2, psi2lin, tmp;

#pragma acc parallel loop collapse(3) //private (cnti, cntj, cntk, psi2, psi2lin, tmp)
for(cnti = 0; cnti < NXDUMMY; cnti ++) {
     for(cntj = 0; cntj < NYDUMMY; cntj ++) {
         for(cntk = 0; cntk < NZDUMMY; cntk ++) {
            psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
            psi2lin = psi2 * G;
            tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
            psi[cnti][cntj][cntk] *= exp(-tmp);
     }
     }
}
}
% pgcc -c test1.c -Minfo=accel -acc -ta=radeon -V14.9
foo:
      8, Accelerator kernel generated
          9, #pragma acc loop gang, vector(256) collapse(3) /* global dim(0) local dim(0) */
         10,   /* global dim(0) local dim(0) collapsed */
         11,   /* global dim(0) local dim(0) collapsed */
      8, Generating present_or_copyin(pot[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating present_or_copy(psi[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating Radeon code

Case #2, removing “collapse” and not adding the “loop” directive to each loop, the compiler will by default try to automatically parallelize the inner loops but must perform loop dependency analysis in order to do so. Note that the outer loop does get parallelized.

% cat test2.c
#include <accelmath.h>
int foo(double *** pot, double *** psi, int NYDUMMY, int NZDUMMY, int NXDUMMY, double dt, double G)
{

long cnti, cntj, cntk;
double psi2, psi2lin, tmp;

#pragma acc parallel loop  //private (cnti, cntj, cntk, psi2, psi2lin, tmp)
for(cnti = 0; cnti < NXDUMMY; cnti ++) {
     for(cntj = 0; cntj < NYDUMMY; cntj ++) {
         for(cntk = 0; cntk < NZDUMMY; cntk ++) {
            psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
            psi2lin = psi2 * G;
            tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
            psi[cnti][cntj][cntk] *= exp(-tmp);
     }
     }
}
}
% pgcc -c test2.c -Minfo=accel -acc -ta=radeon -V14.9
foo:
      8, Accelerator kernel generated
          9, #pragma acc loop gang, vector(256) /* global dim(0) local dim(0) */
      8, Generating present_or_copyin(pot[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating present_or_copy(psi[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating Radeon code
     10, Complex loop carried dependence of 'pot->->->' prevents parallelization
         Complex loop carried dependence of 'psi->->->' prevents parallelization
     11, Complex loop carried dependence of 'pot->->->' prevents parallelization
         Complex loop carried dependence of 'psi->->->' prevents parallelization

Adding “-Msafeptr” or the “restrict” attribute allows the dependency analysis to succeed.

% pgcc -c test2.c -Minfo=accel -acc -ta=radeon -V14.9 -Msafeptr
foo:
      8, Accelerator kernel generated
          9, #pragma acc loop gang /* global dim(0) */
         11, #pragma acc loop vector(256) /* local dim(0) */
      8, Generating present_or_copyin(pot[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating present_or_copy(psi[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating Radeon code
     10, Loop is parallelizable
     11, Loop is parallelizable
% pgcc -c test2r.c -Minfo=accel -acc -ta=radeon -V14.9
foo:
      8, Accelerator kernel generated
          9, #pragma acc loop gang /* global dim(0) */
         11, #pragma acc loop vector(256) /* local dim(0) */
      8, Generating present_or_copyin(pot[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating present_or_copy(psi[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating Radeon code
     10, Loop is parallelizable
     11, Loop is parallelizable

Case #3, add the “loop directive” to each loop.

% cat test3.c
#include <accelmath.h>
int foo(double *** restrict pot, double *** restrict psi, int NYDUMMY,
        int NZDUMMY, int NXDUMMY, double dt, double G)
{

long cnti, cntj, cntk;
double psi2, psi2lin, tmp;

#pragma acc parallel loop gang
for(cnti = 0; cnti < NXDUMMY; cnti ++) {
#pragma acc loop worker
     for(cntj = 0; cntj < NYDUMMY; cntj ++) {
#pragma acc loop vector
         for(cntk = 0; cntk < NZDUMMY; cntk ++) {
            psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
            psi2lin = psi2 * G;
            tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
            psi[cnti][cntj][cntk] *= exp(-tmp);
}}}}
% pgcc -c test3.c -Minfo=accel -acc -ta=radeon -V14.9
foo:
      9, Accelerator kernel generated
         10, #pragma acc loop gang /* global dim(0) */
         12, #pragma acc loop worker(4) /* local dim(1) */
         14, #pragma acc loop vector(32) /* local dim(0) */
      9, Generating present_or_copyin(pot[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating present_or_copy(psi[:NXDUMMY][:NYDUMMY][:NZDUMMY])
         Generating Radeon code
     12, Loop is parallelizable
     14, Loop is parallelizable

Now to your initial question:

What is the best way to parallelize following loop?

It will depend upon the size of the loops. Use collapse when the loop lengths are fairly small in order to increase the parallelism. If the outer loop is large and the inner loops are small, use a “gang” on the outermost loop and collapse the inner loops. Otherwise using the three loop schedules are probably the best. Some experimentation may be required though. The only rule is that the stride-1 dimension should be in a “vector” loop (C which is row major this is the right most dimension). This allows better cache coherency between threads.


Issue #2

Does this mean that I have to define my array sizes in compile time (via define) and not in runtime?

No, as you see in my examples the loop bounds are all passed in values. Given the error, I suspect that there’s an issue with the values being used.

How are the loop bounds variables declared? Are any of the values negative? Could product of the values be larger than what fits in a int?

Issue #3: All 6 cores at 100%

I’m note sure about this one since it doesn’t quite make sense. Even if the generated OpenACC code were running on the host, it would be done sequentially and only use one core. Can you post or send to PGI Customer Service (trs@pgroup.com) a reproducing example? Also, please let us know what platform your using, the compiler version, and what flags you’re compiling with.

  • Mat

Hi Mat,

Thanks for your very elaborate and precise answer! It helped out a lot!

As far as my understanding of nested loops, now I understand these 3 options that you presented very well. I believe that loops that range from 200 to 500 (say X=300, Y=200, Z=100) can be called reasonably small so I believe collapse statement is the best choice. I tried out also other solutions but the end result seems more or less the same time wise.

You didn’t answer what happens to private variables, or do they have to be private but my assumption is that scalars that are used inside kernels are privatized for each kernel automatically. I know that reductions are done automatically at the end if needed.

As far as my problem with loop bounds go I realized that my counters as well as loop bounds where of type long. I changed that to int and now everything works well!

Also 100% processor usage was due the fact that I was compiling with OpenMP flags besides OpenACC flags. When I erased them processor usage went down and everything is fine now.

Unfortunately I have a few more important questions that have arised. Here they are:

  1. I used restrict keyword, and -Msafeptr compiler flag as you suggested, however in one other function I have a loops that are not one after another. They basically go like this:


   #pragma acc parallel loop collapse(2) private(a, array[0:Nx-1])
   for(cntj = 0; cntj < Ny; cntj ++) {
      for(cntk = 0; cntk < Nz; cntk ++) {
	     array[Nx - 2] = array1[Nx - 1][cntj][cntk];
	     for (cnti = Nx - 2; cnti > 0; cnti --) {
	        a = - const1 * array1[cnti + 1][cntj][cntk] + const12 * array1[cnti][cntj][cntk] - const3 * array1[cnti - 1][cntj][cntk];
	        array[cnti - 1] =  array2[cnti] * (const3 * array[cnti] - a);
	     }
	     array1[0][cntj][cntk] = 0.;
	     for (cnti = 0; cnti < Nx - 2; cnti ++) {
	        array1[cnti + 1][cntj][cntk] = array3[cnti] * array1[cnti][cntj][cntk] + array2[cnti];
	     }
	     array1[Nx - 1][cntj][cntk] = 0.;
      }
   }

As you can see I have put the first two loops into collapse statement which is fine I guess. However the two inner loops are unorthodox in sense that there is first a line of code before the loop and then the loop (3 loops are not one after another). Also as it can be seen 2 inner loops are heavily data dependent between each other in terms of Nx dimension and cnti counter, so obviously no parallelization can be introduced here. Compiler recognizes that very well, and even restrict keyword and compiler flag don’t help as compiling always warns about data dependence which in this case is very true! However as you can see I didn’t paralelize those loops so I would like to somehow remove that compiler warning but neither putting restrict keyword nor -Msafeptr have helped. What helps is if I put loop independent clause on those loops, but that is obviously wrong and produces bad result in the end. Also loop seq doesn’t remove warnings.

In the end result it doesn’t matter as the results of calculation are correct if 2 inner loops are not parallelized. I would just like to get rid of unnecessary warnings.

  1. Second and more serious issue that I encounter is that calling this function say 3 times per loop and adding my previous function call to the loop (4 function calls per one loop), and say 1000 of these loops produces program speed that is very very similar to the speed that I’m getting with the OpenMP variant and 6 threads on 6 cores of AMD FX6300+.

On the other hand if I test only the first function given in previous post (collapse 3 with no loop dependencies), I get about twice better time in OpenACC then in the OpenMP (OK but not impressive). However when I introduce these functions with loop dependencies, the results are almost the same as in OpenMP. Also I thought that 4 function calls instead of 1 can introduce that decrease in speed (function call overhead, kernel launch overhead etc…).

Do you have any clues why this is happening or weather this is expected behavior in this case (loop dependencies)? Maybe redesign of previous code could introduce better speedup? Keep in mind that original OpenMP code had threading only on the out-most loop while 2 inner loops where done in single threads.

This is kind of discouraging as I was expecting to get much better results in OpenACC case. Of course it probably has to do with loop dependencies but still… I was testing out OpenACC on my computer with a simple Laplace heat equation program. It showed speedup of about 5 fold when regular 2 nested loops where used. I was getting 10.5 seconds in OpenMP, and 2.1 seconds in OpenACC. Although that program didn’t have explicit loop dependencies. And now I’m getting same execution times for my program between OpenMP and OpenACC…[/b]

You didn’t answer what happens to private variables, or do they have to be private but my assumption is that scalars that are used inside kernels are privatized for each kernel automatically. I know that reductions are done automatically at the end if needed

Sorry, missed this one. Scalars are private by default and are declared local to the kernel thus allowing them to put into registers. Add a variable to a “private” clause will create an array of the variable (one for each thread) and place it in global memory. Hence, it’s better to not put scalars in a private clause. However in some cases, such as when a scalar is passes by reference to a subroutine, then the scalar needs to be put in the private clause.

As far as my problem with loop bounds go I realized that my counters as well as loop bounds where of type long. I changed that to int and now everything works well!

Using long shouldn’t matter and long is required for large array support. Something else is going on. Can you please send a reproducer to PGI Customer Service (trs@pgroup.com)?


For the loop in #1, there’s a couple of issues.

First the schedule. Ideally you want a gang, or gang/worker schedule for the two outer loops and a vector schedule for each of the inner loops. However because the inner loop index corresponds to the first dimension, scheduling this loop as vector will give poor data access since it’s not the stride-1 dimension.

Your choices are:

Only the accelerate the outer loops ensuring the “Nz” loop uses a vector schedule.

   #pragma acc parallel loop collapse(2) gang vector private(array[0:Nx-1])
   for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) {

or

   #pragma acc parallel loop gang 
   for(cntj = 0; cntj < Ny; cntj ++) { 
   #pragma acc loop vector
      for(cntk = 0; cntk < Nz; cntk ++) {

However, this will limit the amount of parallelism especially given these are your smallest loops. Also, the private “array” will have poor data access pattern.

Another choice is to schedule the outer loops as the gang and inner loops as vector.

    #pragma acc parallel loop collapse(2) gang private(array[0:Nx-1]) 
   for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
        array[Nx - 2] = array1[Nx - 1][cntj][cntk]; 
    #pragma acc loop vector
        for (cnti = Nx - 2; cnti > 0; cnti --) { 
           a = - const1 * array1[cnti + 1][cntj][cntk] + const12 * array1[cnti][cntj][cntk] - const3 * array1[cnti - 1][cntj][cntk]; 
           array[cnti - 1] =  array2[cnti] * (const3 * array[cnti] - a); 
        } 
        array1[0][cntj][cntk] = 0.; 
        #pragma acc loop vector
        for (cnti = 0; cnti < Nx - 2; cnti ++) { 
           array1[cnti + 1][cntj][cntk] = array3[cnti] * array1[cnti][cntj][cntk] + array2[cnti]; 
        } 
        array1[Nx - 1][cntj][cntk] = 0.; 
      } 
   }

Here, the parallelism is increased at the cost of poor data access for “array1”. However, the private “array” will only be created for each gang instead of every thread, thus providing a better data access.

Note, manually privatizing “array” may help here since you can then arrange the dimension to match stride-1 access:

    
double *** array;
...
#pragma acc parallel loop collapse(2) gang create(array[0:Ny][0:Nz][0:Nx-1])
   for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
        array[cntj][cntk][Nx - 2] = array1[Nx - 1][cntj][cntk]; 
    #pragma acc loop vector
        for (cnti = Nx - 2; cnti > 0; cnti --) { 
           a = - const1 * array1[cnti + 1][cntj][cntk] + const12 * array1[cnti][cntj][cntk] - const3 * array1[cnti - 1][cntj][cntk]; 
           array[cntj][cntk][cnti - 1] =  array2[cnti] * (const3 * array[cnti] - a); 
... etc.

If you were using an NVIDIA GPU with compute capability 3.5 or greater, you can decorate “array2” with “const” and “restrict” and then this memory would be placed in Textured memory. Textured memory is read-only and acts as a very fast large cache.

Maybe redesign of previous code could introduce better speedup?

The loop structure and algorithm is probably fine, though changing the data layout of the arrays would help. I.e. change the arrays so they are indexed as “array[Ny][Nz][Nx]”.

\

  • Mat

Hey Mat,

Thanks for the quick response as usual! I tried proposed solution immediately. The first solution with the:

   
#pragma acc parallel loop collapse(2) gang vector private(array[0:Nx-1])

gave correct results, albeit slow execution. Execution time is actually the same if I don’t put gang vector in the statement. So it seems compiler does that itself without intervention.

Now the second proposed solution leaves me puzzled:

    #pragma acc parallel loop collapse(2) gang private(array[0:Nx-1]) 
   for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
        array[Nx - 2] = array1[Nx - 1][cntj][cntk]; 
    #pragma acc loop vector 
        for (cnti = Nx - 2; cnti > 0; cnti --) { 
           a = - const1 * array1[cnti + 1][cntj][cntk] + const12 * array1[cnti][cntj][cntk] - const3 * array1[cnti - 1][cntj][cntk]; 
           array[cnti - 1] =  array2[cnti] * (const3 * array[cnti] - a); 
        } 
        array1[0][cntj][cntk] = 0.; 
        #pragma acc loop vector 
        for (cnti = 0; cnti < Nx - 2; cnti ++) { 
           array1[cnti + 1][cntj][cntk] = array3[cnti] * array1[cnti][cntj][cntk] + array2[cnti]; 
        } 
        array1[Nx - 1][cntj][cntk] = 0.; 
      } 
   }

The problem here is that as I stated in previous post vectorized loops have data dependencies in them:

  
array[cnti - 1] =  array2[cnti] * (const3 * array[cnti] - a);

as well as:

array1[cnti + 1][cntj][cntk] = array3[cnti] * array1[cnti][cntj][cntk] + array2[cnti];

I do realize that it is up to me to solve this somehow, and I will. I just wanted you to know that this kind of parallelization is impossible with the current code. I mean it runs faster (say without vectorized loop it runs 1m30s, and with vectorized loop it runs 50s), but the result is incorrect (usually it is nan) as there is data dependency in-between vectors.

Also the proposed solution with array[0:Ny][0:Nz][0:Nx-1] created on the device, instead of a single dimension array is producing a nan result, that is incorrect result. Although the speed is also increased in that case two fold but to no avail.

The last proposed solution with const restrict obviously is not possible as I have Radeon card. Also that array is global and initialized in runtime so I’m not sure how can it be const at the same time…

Anyways I won’t be bothering you with any new questions unless something important arises. But for me it is probably best to try to find out how to remove data dependencies in those inner loops and vectorize them, because it seems that is the only way to speed up that code for real and everything else is just not significant enough.

Thanks again,

Bogdan

Hi Mat,

It’s me again. I’m still battling through that loop carried dependency and it is quite unknown weather I will be able to solve it all-together or weather it will have to stay like that. Lots of numerical algorithms (differentiation, integration) depend on those dependent loops. Are there any known techniques that help OpenACC parallelize those efficiently, or is OpenACC restricted to strictly regular loops (say) a = a + const * b?

I have another dilemma. I’m trying to parallelize following loop, but it seems pretty hard to do so:

   #pragma acc parallel loop collapse(3) private(B[0:Ny], C[0:Nz])
   for(cnti = 0; cnti < Nx; cnti ++) {
      for(cntj = 0; cntj < Ny; cntj ++) {
	   for(cntk = 0; cntk < Nz; cntk ++) {
    		C[cntk] = P[cnti][cntj][cntk] * P[cnti][cntj][cntk];
           }
 	   B[cntj] = funct(const, C, Nz);
       }
       A[cnti] = funct(const, B, Ny);
   }

This kind of pragma is not working. It reports following error:

509, Loop without integer trip count will be executed in sequential mode
Accelerator region ignored
510, Accelerator restriction: size of the GPU copy of ‘A’ is unknown
512, Accelerator restriction: size of the GPU copy of ‘P’ is unknown
515, Accelerator restriction: call to ‘funct’ with no acc routine information

The inmost loop is not problematic as it is regular. However the other two loops have function calls in them. These functions require both C and B to be calculated whole before next iteration of inner loop can occur. Keep in mind that function funct works on the whole array C or B in loops, so basically even that function could be parallelized, but I’m not sure how. I tried with #pragma acc routine worker, but it doesn’t work.

I do realize that this is pretty much sequential code in its basis, but are there any options for parallelization here or is the scheduling for this so taxing that it is better off to be done via OpenMP on the host?

Hi Bogdan,

Loops with backwards dependencies are not parallelizable. This is true for OpenACC, OpenMP as well as other parallel models. The problem being that one iteration of the loop needs to complete before the next, thus creating the dependency. If it were a forward dependency, i.e. “Arr_=Arr[I+1]*expr” then you could create a copy of the Array to break the dependency: “Arr=OldArr[I+1]*expr”.

Apologize for leading you astray earlier. I missed the dependency when just looking at the code.

As for the new issues, can you please post a reproducing example? I tried but failed to reproduce the errors you see. I suspect the loop trip count issue is due to the fact that Nz and Ny are being passed into “funct”, but without knowing how the declaration it’s hard to tell.

For the unknown size of the “A” and “P” arrays, that’s because the compiler can’t tell the loop trip counts so doesn’t know the amount of data needed in the loop. You can always use the OpenACC data clauses to express how much data to copy over.

For “funct”, you need to either inline the function or add the OpenACC “routine” directive to funct’s definition and prototype in order to have the compiler create a device routine.
\

  • Mat_

Hi Mat,

Thanks for your reply. I will try out the proposed solutions and report back in a few days. I’m kind of in a hectic schedule right now.

Cheers,

Bogdan

Hey Mat,

Thanks for your previous answers. It is really hard for me to accept that forward and backward data dependent loops cannot be paralellized. I heard about prefix sum algorithm and have read some articles about it, but it seems pretty complicated implementation wise, and I have unfortunately no time to implement it myself as I’m involved in multiple more important tasks right now. I’ll leave it for some other occasion.

So I’ve finally found time to work on this code a bit more. Anyways, there are some interesting things going on with this code…

First of all if I compile with following:

  #pragma acc parallel loop collapse(3) private(B[0:Ny], C[0:Nz]) 
   for(cnti = 0; cnti < Nx; cnti ++) { 
      for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
          C[cntk] = P[cnti][cntj][cntk] * P[cnti][cntj][cntk]; 
           } 
       //B[cntj] = funct(const, C, Nz); 
       } 
      //A[cnti] = funct(const, B, Ny); 
   }

and run this code, the following output is given (bear in mind that function calls are commented):

ACC: allocate size=12288000000 fails

Now I don’t really understand what this means?

If I go with plain kernels (function calls are still commented):

  #pragma acc kernels
   for(cnti = 0; cnti < Nx; cnti ++) { 
      for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
          C[cntk] = P[cnti][cntj][cntk] * P[cnti][cntj][cntk]; 
           } 
       //B[cntj] = funct(const, C, Nz); 
       } 
      //A[cnti] = funct(const, B, Ny); 
   }

The code runs, but speed is almost 2 times slower then if I put the following:

    #pragma acc parallel loop gang private(B[0:Ny], C[0:Nz])
   for(cnti = 0; cnti < Nx; cnti ++) { 
      for(cntj = 0; cntj < Ny; cntj ++) { 
      for(cntk = 0; cntk < Nz; cntk ++) { 
          C[cntk] = P[cnti][cntj][cntk] * P[cnti][cntj][cntk]; 
           } 
       //B[cntj] = funct(const, C, Nz); 
       } 
      //A[cnti] = funct(const, B, Ny);
   }

I don’t get the difference between the previous two implementations of parallelism? One is plain kernels, the other is plain acc loop split supposedly in between gangs on Nx level only? Also B and C array should be gang independent (am I doing that right with the private clause)? Is the second and third version of the code doing the same thing? I think not.

Now to the “funct”. It does basically following:

#pragma acc routine worker
double funct(double h, double *a, int N) {
   int cnti;
   double s, si, sj, sk;
   
   si = 0.; sj = 0.; sk = 0.;

   #pragma acc loop
   for(cnti = 1; cnti < N - 1; cnti += 2) {
      si += a[cnti];
      sj += a[cnti - 1];
      sk += a[cnti + 1];
   }
   
   s = sj + 4. * si + sk;
   if(N % 2 == 0) s += (5. * a[N - 1] + 8. * a[N - 2] - a[N - 3]) / 4.;

   return s * h / 3.;
}

As you can see the function has a loop in it. I parallelized it as i-1, i and i+1 elements are not written to, just read from, so there is no loop dependency here I believe.

Also I’ve put #pragma acc routine worker in the front of the function, as well as before function declaration in .h file. Array a and it’s size should be known as I’ve copied the array a (array C in the function call), to the device prior to the whole first code that I put in the beginning of the post. Also I’ve copied array B to the device prior everything (A[cnti] = funct(const, B, Ny);) I don’t understand why compiler still cannot recognize the array length despite array being copied prior?

So, despite putting #pragma acc routine worker, and despite copying arrays prior to their usage in the same function, the compiler reports the following;

509, Loop without integer trip count will be executed in sequential mode
Accelerator region ignored
511, Accelerator restriction: size of the GPU copy of ‘C’ is unknown
512, Accelerator restriction: size of the GPU copy of ‘B’ is unknown
Accelerator restriction: size of the GPU copy of ‘P’ is unknown
515, Accelerator restriction: call to ‘funct’ with no acc routine information

That happens if I uncomment only the first function call. If I leave it commented everything compiles fine.

Any quick help is greatly appreciated. I have only a couple of days left to use OpenACC compiler trial version so I would like to solve this problem. Friday is the last day I can use OpenACC compiler I believe.

Thanks!

Bogdan

Hi Bogdan,

ACC: allocate size=12288000000 fails

When you privatize a variable on a vector loop, every thread will get it’s own copy of the variable. So by privatizing B and C you’re using up too much memory on the device.

You can try doing something like the following so that only the gang and worker loops use the private variables, but depending upon the values of Ny and Nz, it still may use too much memory.

// B is private to each gang, but shared by all workers/vectors in a gang   
   #pragma acc parallel loop gang private(B[0:Ny]) 
   for(cnti = 0; cnti < Nx; cnti ++) { 
// C is private to each worker, but shared all vectors within the worker
   #pragma acc loop worker private(C[0:Nz])
      for(cntj = 0; cntj < Ny; cntj ++) { 
    #pragma acc loop vector
      for(cntk = 0; cntk < Nz; cntk ++) { 
          C[cntk] = P[cnti][cntj][cntk] * P[cnti][cntj][cntk]; 
           } 
       //B[cntj] = funct(const, C, Nz); 
       } 
      //A[cnti] = funct(const, B, Ny); 
   }



The code runs, but speed is almost 2 times slower then if I put the following:

In this code you’ve only parallelized the outer loop and only have 1 vector per gang. I wouldn’t expect this schedule to perform well.

I don’t get the difference between the previous two implementations of parallelism?

The following article gives a good explanation of the differences between “kernels” and “parallel”. Account Login | PGI

As you can see the function has a loop in it. I parallelized it as i-1, i and i+1 elements are not written to, just read from, so there is no loop dependency here I believe.

There is a dependency, but it’s on the scalars. In these cases you need a reduction. However, we wont support loop reductions within “routines” until early next year. Also, this support will be restricted to NVIDIA devices with compute capability 3.5 or above.

To give advice on the rest, it would be very helpful to have a small example which illustrates the issues.

Friday is the last day I can use OpenACC compiler I believe.

You can always send a note to PGI Customer Service (trs@pgroup.com) and request that your trial license be extended.

  • Mat

Hi Mat,

I’ve obviously haven’t done my homework and haven’t read documentation carefully enough… I will read the link that you have suggested.

Other than that it seems that I have little room to parallelize my algorithm altogether. I’ll do what is possible.

As an end note I already extended my trial licence once. Guy from the PGI customer service told me that he could extend it once again if I give some code that shows increase of OpenACC speed over other (say OpenMP) code. However I have no such code other than some simple example of Laplace heat equation solver given to me at HPC school.

I do have my scientific code. I have changed it so it is not obvious what it is doing. I cannot infringe copyright of that code as it is a product of several scientific papers over number of years. That is also a reason why I’m not posting exact reproducing examples on a public forum. Sorry for that but I just cannot compromise our code.

Regards,

Bogdan

Hi Mat,

It’s me again :)

I’ve obtained a new trial license. Thanks to Dave for that!

Anyways, I have a quick question. Is there a way to tell compiler to ignore loop carried dependencies between arrays although they exist? For example:

	     
for (cnti = Nx - 2; cnti > 0; cnti --) {
		c = - Ax * psi[cnti + 1][cntj][cntk] + Ax0r * psi[cnti][cntj][cntk] - Ax * psi[cnti - 1][cntj][cntk];
		cbeta_acc[cnti - 1] =  cgammax[cnti] * (Ax * cbeta_acc[cnti] - c);
 }

Obviously has a dependence on the cbeta_acc array. However I didn’t paralellize this loop at all. I’ve left it unparallelized, but compiler complains anyways all the time. As we concluded earlier this kind of loop cannot be parallelized (at least not easily). Thus I would like to get rid of all those warnings as they pile up on my compiling info.

I’ve put -Msafeptr into compiling flags, as well as put restrict keyword on cbeta_acc, but I still get the:

Loop carried dependence of ‘cbeta_acc->’ prevents parallelization
Loop carried backward dependence of ‘cbeta_acc->’ prevents vectorization

Any hints?

Thanks.

Hi timek28,

Anyways, I have a quick question. Is there a way to tell compiler to ignore loop carried dependencies between arrays although they exist?

Yes, add the “independent” clause to the loop directive or use the “parallel” directive instead of “kernels”. “parallel” assumes you know that the loop is independent while “kernels” does more analysis.

  • Mat

Hi Mat,

Independent clause won’t cut it here. As I said loops ARE DEPENDENT, thus independent will produce an error. However I believe now that the dependency warnings should be there anyways as there is real dependency problem. I just wanted to try to eliminate them from compiling warnings as I know there is dependency and I’m not even trying to parallelize those loops. But I’ll let them stay. That is not my main issue right now.

I’ll bother you some more time unfortunately. I’m trying to dig deeper into my code and understand why are some things faster on GPU (OpenACC) then CPU (OpenMP). Here is the code snippet that I’m trying to dissect:

   #pragma acc kernels
   //#pragma acc parallel loop collapse(3) 
   for(cnti = 0; cnti < Nx; cnti ++) {
      for(cntj = 0; cntj < Ny; cntj ++) {
         for(cntk = 0; cntk < Nz; cntk ++) {
            psi2 = psi[cnti][cntj][cntk] * psi[cnti][cntj][cntk];
            psi2lin = psi2 * G;
            tmp = dt * (pot[cnti][cntj][cntk] + psi2lin);
            psi[cnti][cntj][cntk] *= exp(-tmp);
	  } 
      }
   }

You can see that I’ve posted this code in one of my first posts. I was wondering which pragma to use (kernels, collapse, gang, worker and vector). For reasonably small Nx, Ny and Nz I have found out that kernels or collapse are the best choice. More or less the times are same. Gang, worker and vector give slower execution.

Anyways my new dilemma is within the loop itself. These 3 loops are executed say 500 times. I get two times faster execution in case of OpenACC then OpenMP (6.5 sec - 13 sec). This is an OK result (I would like better speedup but heck any speedup is good).

However I don’t understand completely the nuances of the speedup itself. Meaning I don’t always know which instruction gives edge to OpenACC over OpenMP. I tried commenting out different parts of code to find out what part of code is producing the speedup and what I found out is that computation itself is not where difference lies. It is more about memory accesses. Funny enough variables, dt and G, as well as array pot are the things that create the difference time wise. dt, G and pot are global variables that are read. In OpenACC they are copied to the device prior to the loops, and I believe kernels gives a private copy to each kernel of the scalars, while arrays are spread normally within the threads. In OpenMP version of the program I believe scalars such as dt and G are shared between the threads. If I replace these dt, G and pot with consts such as number 1, the execution time is almost the same. However if they are variables the execution time is 2 times faster then OpenMP. Obviously the edge is in the faster memory access only (GDDR5 vs DDR3).

Is there any document that gives overview of where OpenACC excels over OpenMP? I want to achieve maximum speedup possible, as I found out in heat equation. Out there I had program run 5 times faster in OpenACC then in OpenMP version. Here however, this function executes only 2 times faster, while 3 following functions are even more problematic (and possibly impossible to run faster).

Thanks in advance, and sorry for long posts.

Hi Mat,

I’m sorry for my previous question.It surely doesn’t have to do directly anything with PGI or OpenACC so you can ignore it.

Anyways, I found out how to increase the speed of program by removing all function calls containing kernels altogether. Inlining functions improved speed a lot. Also I read that this is expected behavior and that inlining should be implemented wherever possible. Is this true, and why are kernels inside looping function calls so taxing?

One more question at the end (I won’t be bothering you anymore). I found out strange timings for different versions of my code. For example:

#pragma acc parallel loop collapse(2) gang private(cbeta_acc[0:Nx])
	for(cntj = 0; cntj < Ny; cntj ++) {
		for(cntk = 0; cntk < Nz; cntk ++) {
			cbeta_acc[Nx - 2] = psi[Nx - 1][cntj][cntk];
			//#pragma acc loop vector
			for (cntii = Nx - 2; cntii > 0; cntii --) {
				c = - Ax * psi[cntii + 1][cntj][cntk] + Ax0r * psi[cntii][cntj][cntk] - Ax * psi[cntii - 1][cntj][cntk];
			 	cbeta_acc[cntii - 1] =  cgammax[cntii] * (Ax * cbeta_acc[cntii] - c);
			}
			psi[0][cntj][cntk] = 0.;
			 //#pragma acc loop vector
			for (cntii = 0; cntii < Nx - 2; cntii ++) {
			  	psi[cntii + 1][cntj][cntk] = calphax[cntii] * psi[cntii][cntj][cntk] + cbeta_acc[cntii];
			}
			psi[Nx - 1][cntj][cntk] = 0.;
		}
	}

Gives 6.436s for 500 calls of this code. Two outmost loops are Ny and Nz.

Then this version:

#pragma acc parallel loop collapse(2) gang private(cbeta_acc[0:Nx])  
	for(cntii = 0; cntii < Nx; cntii ++) {
        	for(cntk = 0; cntk < Nz; cntk ++) {
			cbeta_acc[Ny - 2] = psi[cntii][Ny - 1][cntk];
		        //#pragma acc loop vector
		        for (cntj = Ny - 2; cntj > 0; cntj --) {
		        	c = - Ay * psi[cntii][cntj + 1][cntk] + Ay0r * psi[cntii][cntj][cntk] - Ay * psi[cntii][cntj - 1][cntk];
		         	cbeta_acc[cntj - 1] =  cgammay[cntj] * (Ay * cbeta_acc[cntj] - c);
		        }
		        psi[cntii][0][cntk] = 0.;
		        //#pragma acc loop vector
		        for (cntj = 0; cntj < Ny - 2; cntj ++) {
		        	psi[cntii][cntj + 1][cntk] = calphay[cntj] * psi[cntii][cntj][cntk] + cbeta_acc[cntj];
		        }
		        psi[cntii][Ny - 1][cntk] = 0.;
		}
	}

Gives 6.655s for 500 calls of this code. Almost the same time as previous. Two outmost loops are Nx and Nz.

And lastly:

#pragma acc parallel loop collapse(2) gang private(cbeta_acc[0:Nx]) 
	for(cntii = 0; cntii < Nx; cntii ++) {
		for(cntj = 0; cntj < Ny; cntj ++) {
			cbeta_acc[Nz - 2] = psi[cntii][cntj][Nz - 1];
		        //#pragma acc loop vector
		        for (cntk = Nz - 2; cntk > 0; cntk --) {
		        	c = - Az * psi[cntii][cntj][cntk + 1] + Az0r * psi[cntii][cntj][cntk] - Az * psi[cntii][cntj][cntk - 1];
		         	cbeta_acc[cntk - 1] =  cgammaz[cntk] * (Az * cbeta_acc[cntk] - c);
		        }
		        psi[cntii][cntj][0] = 0.;
		        //#pragma acc loop vector
		        for (cntk = 0; cntk < Nz - 2; cntk ++) {
		         	psi[cntii][cntj][cntk + 1] = calphaz[cntk] * psi[cntii][cntj][cntk] + cbeta_acc[cntk];
		        }
		        psi[cntii][cntj][Nz - 1] = 0.;
		}
	}

Gives 12.324sfor 500 calls of this code. Twice longer time then previous 2!! Two outmost loops are Nx and Ny.

I don’t understand why this happens. Shouldn’t the time be smallest when Nx and Ny are the two outer loops and Nz is stride 1 dimension?? Instead I get twice longer time for this supposedly “perfect” arrangement of loops…

Also keep in mind that only two outer loops are paralelized not the inner ones due to data dependency.

Thanks for everything.

Bogdan

Hi Bogdan,

Anyways, I found out how to increase the speed of program by removing all function calls containing kernels altogether. Inlining functions improved speed a lot. Also I read that this is expected behavior and that inlining should be implemented wherever possible.

Inlining in general helps by reducing call overhead and can increase the opportunities for optimization.

why are kernels inside looping function calls so taxing?

There’s overhead to launching kernels. It’s fairly small but can add up if you’re call them many times.

I don’t understand why this happens. Shouldn’t the time be smallest when Nx and Ny are the two outer loops and Nz is stride 1 dimension?? Instead I get twice longer time for this supposedly “perfect” arrangement of loops…

It’s because you’re not vectorizing the stride-1 dimension (cntk) thus causing memory divergence. All threads in a warp will fetch data from memory at the same time. If that data is contiguous in memory, then the data can be brought in one operation. If not, as is the case here, then each thread must wait for each thread to get their data.

For example, on the first iteration of the first “cntk” loop you have several accesses to psi. If we order the threads by which element of “psi” they access, we have something like:

thread 0: psi[0][0][Nz-2]
thread 1: psi[0][1][Nz-2]

thread 31: psi[0][31][Nz-2]

Given threads 0 through 31 fetch data at the same and there’s “Nz” number of elements between each access, it will take 32 fetches to global memory to get all the data. If the memory was contiguous, it would take one fetch and the remaining threads would get their data from the cache.

  • Mat

Hi Mat,

Thanks for the previous answer. Just one more question.

Considering that OpenACC Gang translates to CUDA Thread-block, Worker translates to Warp and Vector translates to Thread - what is the translation in case of AMD GPUs? I mean in terms of wave-fronts, CUs, PEs or whatever processing elements AMD has.

Thanks in advance,

Bogdan

Hi Bogdan,

For NVIDIA targets, a gang does translate to the thread-block and a vector to a thread, but a worker translates to a thread as well. If both worker and vector are used together, then two dimensions of threads are used.

Similar with AMD, a gang translates to a work group, while worker and vector translate to a work item. Again when used together, a worker and vector create work items in a two dimensional work group.

  • Mat