OpenACC: Best way to parallelize nested DO loops (continued)

Im trying to get some speedup using a GPU by compiling my demo code (below):

pgf77 -static-nvidia -acc -gpu=cc75 -o laplace_tesla laplace05.for

…since im running it on a tesla T4.

I have tried all sorts of !$acc commands but nothing seems to give any GPU speedup. The serial demo code runs in 30sec, and in parallel (multicore mode) it shows good speedup, much the same as OpenMP, so with ACC_NUM_CORES=2 its twice as fast. But in GPU mode its taking abut 18sec, so a bit faster than serial, but slower than multicore. So can anyone suggest how to optimise the ACC commands, or do I need to learn CUDA? (also note here imm=jmm=kmm=200, so its an 8 million cell grid effectively)

      onesixth = 1.0/6.0 

      do it = 1, nits

!$acc parallel loop collapse(3)
        do k = 2, kmm
        do j = 2, jmm
        do i = 2, imm
          phi(i,j,k) = ( phi(i-1,j,k) + phi(i+1,j,k)        
     &                 + phi(i,j-1,k) + phi(i,j+1,k)        
     &                 + phi(i,j,k-1) + phi(i,j,k+1) )*onesixth
        enddo
        enddo
        enddo

      enddo
1 Like

How are you handling the data movement?

I’m not seeing any data directives so you’d be relying of the compiler to implicitly copy the data. Though this would be just before and after the kernel launch, meaning you’d be copying data back and forth for each iteration of the “it” loop.

If you run the code through the Nsight-Systems profiler, or set the environment variable “NV_ACC_TIME=1”, this will show you how much time is being spent copying data as well as the kernel execution time.

A basic example for using data directives would be something like:

      onesixth = 1.0/6.0

!$acc data copy(phi(:,:,:))
      do it = 1, nits

!$acc parallel loop collapse(3) present(phi)
        do k = 2, kmm
        do j = 2, jmm
        do i = 2, imm
          phi(i,j,k) = ( phi(i-1,j,k) + phi(i+1,j,k)
     &                 + phi(i,j-1,k) + phi(i,j+1,k)
     &                 + phi(i,j,k-1) + phi(i,j,k+1) )*onesixth
        enddo
        enddo
        enddo

      enddo
!$acc end data

Hi Mat, thankyou for your quick reply…

when I set export NV_ACC_TIME=1 on remote machine, it comes back with profile library libaccprof.so not found: libaccprof.so: cannot open shared object file: No such file or directory. maybe because im compiling on my local machine with -static-nvidia. Is there a way to include the libaccprof.so file in with the static compilation?

Now when I ran the code with your data copy included, it runs about x30 faster,
(0.9sec - compared with 30sec serial). However the monitor of phi (in the middle of the grid) shows as zero during the iterations, but shows a non-zero value when its finished.

The final value of phi (gpu mode) now is 3.6218948E-06, compared with 7.6612935E-04 (serial mode). For context the value of phi starts at zero and then tends towards 1. So Im wondering if the computation is correct - because the final value should be roughly the same as the serial one - I think. Thanks, Giles

PS. I ran it for 10,000 iterations (took 6 sec) - and the solution looks good

Hi Mat, I uploaded the libaccprof.so file to the remote hpc, and then set LD_LIBARARY_PATH=. and it seems to have given me some profile data now (below) but there is another warning about libcupti.so not found. So what do you make of the data below? Thanks Giles

Hi Giles,

Looking at your code, this is probably expected. Notice the dependencies?

Since you have phi on both the left and right hand sides, and the right hand uses indices in adjacent cells, the computation will be different depending on the order in which the loop iterations are performed. When running in parallel, the execution order is indeterminant.

For the forward dependencies, i.e. “i+1”, “j+1”, “k+1”, you can create a copy of “phi”, ex. “phiOld”. Though the backward dependencies are problematic since they depend on the previous iteration being completed before the next iteration can be performed.

Basically, you’’ need to rethink your algorithm to remove the dependencies in order to parallelize the code.

Glad you we able to solve the performance problem. If you remove the data regions and then rerun the profile, you’ll likely see all the time is spent in data movement.

-Mat

Cool - thanks very much - that all makes sense.

So is there any way to improve the speedup further,
or is x30 about as good as you can expect from serial to gpu?
do I need to use async or gangs or some other commands ?

It would be dependent on the code. I’ve seen code get 100x, 200x, and even an outlier at 1000x.

The question is if 30x the best you can get from this code, but not something I can answer. For that, run the code through Nsight-Compute and see how close it’s getting to the SOL (Speed-of-Light), i.e. achieving peak performance of the device.

do I need to use async or gangs or some other commands ?

“async” wouldn’t improve the kernel time, but it can help a bit with the wall clock time by hiding the launch overhead. As one kernel is executing, the CPU thread can launch the next kernel, effectively hiding the cost.

You can try other schedules to see if things improve, but would expect a collapse(3) to be optimal here. Though it’s easy to experiment. Just a few examples of schedules to try:

  1. Collapse the outer two loops and schedule across the gangs (CUDA Blocks). Distribute the inner loop across the vectors (CUDA thread x-dimension) within each gang
!$acc parallel loop gang collapse(2) present(phi)
        do k = 2, kmm
        do j = 2, jmm
!$acc loop vector
        do i = 2, imm
  1. Gang outer, collapse the inner vector loops
!$acc parallel loop gang present(phi)
        do k = 2, kmm
!$acc loop vector collapse(2)
        do j = 2, jmm
        do i = 2, imm
  1. Gang outer, use workers (CUDA Thread y-dimension) on the middle loop, vector inner
!$acc parallel loop gang present(phi)
        do k = 2, kmm
!$acc loop worker
        do j = 2, jmm
!$acc loop vector
        do i = 2, imm
  1. Tile the inner loops. Similar to #3, but with specific sizes for the x and y thread dimensions
!$acc parallel loop gang present(phi)
        do k = 2, kmm
!$acc loop tile(16,16)    
        do j = 2, jmm
        do i = 2, imm
  1. Let the compiler decide:
!$acc kernel loop independent present(phi)
        do k = 2, kmm
!$acc loop independent
        do j = 2, jmm
!$acc loop independent
        do i = 2, imm

Like it would do a Gang, Gang-vector, Vector schedule, but to see what it actually does, and the flag “-Minfo=accel” and the schedule will be shown in the feedback messages.

There’s other permutations to try, but hopefully this gives you ideas.

-Mat

Oh, I should mention you can combine gang, worker, and vector on the same loop. For example:

!$acc parallel loop gang present(phi)
        do k = 2, kmm
!$acc loop worker vector collapse(2)
        do j = 2, jmm
        do i = 2, imm

This collapses the inner loop and the creates a strip-mine loop (i.e. a loop sized to the vector length).

There’s also the “vector_length”, “num_workers”, and “num_gangs” clauses if you want to override the default sizes.

Thats great - thanks for the additional suggestions
That gives me a few things to try - Best wishes, Giles

So can I check - if the data copy() statements enclose several do-loops, would you have to specify all the array names in the 1st data statement, and then specify the specific array names (where used) in the present() statements ?

!$acc data copy(phi1, phi2, phi3)

  !$acc parallel loop collapse(4) present(phi1)
      do i = 1,ni
      <some phi1 code>
      enddo
  !$acc end parallel

  !$acc parallel loop collapse(4) present(phi2)
      do i = 1,ni
      <some phi2 code>
      enddo
  !$acc end parallel

  !$acc parallel loop collapse(4) present(phi3)
      do i = 1,ni
      <some phi3 code>
      enddo
  !$acc end parallel

!$acc end data

The “present” clause is used to enforce at runtime that the data is actually present on the device, In this example given the compute regions are within a structured data region, it’s not necessary. The compiler sees that they are in the same scope.

I just get in the habit of using it since unstructured regions (i.e. enter/exit data) are usually not within the same scope, hence I want to ensure I didn’t miss anything.

if the data copy() statements enclose several do-loops, would you have to specify all the array names in the 1st data statement, and then specify the specific array names (where used) in the present() statements ?

Again, it’s optional here. But in general when you want to ensure data is present on the device, each array variable should be put in the “present” clause’s list.

Note you can use “default(present)” which will apply present to all data which is global by default. Does not apply to variables that are implicitly private such as scalars.

Hi Mat,

right so now its running ok in serial but im getting NaNs when running on the GPU,

so Ive got a data copy statement putting all the arrays in there…
!$acc data copy(obs,vel,fin,fout,feq,rho,vec,wt,col1,col2,col3,vin)

then all of the loop statements are like this…
!$acc parallel loop collapse(3) default(present)

so it seems to run ok (without errors) but there are nans in the output data,
so Im thinking there must be something wrong with the data management
(or whatever you call it). Have you got any ideas please where I should look?

Hard to say without an example and these variables aren’t in the code snip-it you posted earlier so I’m assuming this is a different code. Granted the code above isn’t parallelizable due to the dependencies.

You can try adding the flag “-gpu=managed” so all allocated data will be put in CUDA Unified Memory and the CUDA driver will handle the data movement. If this works, then the problem is more likely with your data directives. Unless the problem is with a static array or scalar.

Otherwise, you’ll want to look at the offloaded code. Do you still have dependencies? Do you need to privatize a variable? Do you need to atomically update some data?

If you can post a minimal reproducing example, I can take a look.

ok thanks - there are some good things to look into there…
I am already setting -gpu=cc75, so how should I add -gpu=managed also please?

Either “-gpu=cc75 -gpu=managed” or “-gpu=cc75,managed”

ok thanks - I think it may need some private statements,
if its anything like openMP that is. thanks for your help.

if its anything like openMP that is

In OpenMP, all variables except for parallel loop index variables by default.

In OpenACC, arrays are shared by default, but scalars are (mostly) firstprivate by default.

For full details, see Section 2.6 of the OpenACC standard.

Hi Mat,
I had to rewrite the code to include all 8 sets of nested do-loops into one big one. Now having done that, the one which seems to works best is:

!$acc kernel loop independent present(phi)
        do k = 1, km
!$acc loop independent
        do j = 1, jm
!$acc loop independent
        do i = 1, im

And it runs fast - but the problem now is that Nans are creeping into the solution after say 200 iterations (of the big nested do-loop) but the problem is that its variable. In other words sometimes the solution is full of Nans and sometime not (at 200 iterations). Its generally ok after 100 iterations, but at 200 iterations its a bit hit-and-miss. Note (in serial and multicore mode there is not the same problem/issue)

So Im wondering why is there this random nature in the solver? Its quite difficult to isolate the problem when its variable. I’m assuming that the partitioning is slightly different each time I run the executable. So how do I kind of fix the partitioning whilst still using loop independent, if that’s possible?

The other thing that concerns me is that (in the streaming step) the code specifies values outside of its own domain. In other words, values on the neighbouring cells will be specified, at the edge of a block. So I’m wondering if that is causing a problem somehow (not sure). Regards Giles.

Finally got something to run quickly on the GPU, but had to split the large nested do-loop into 2, with the (last) streaming step in a separate nested do-loop at the end. It runs super fast compered with serial - I’m just doing the serial timing right now for comparison. But amazingly the only mods to the code are these shown below:

!$acc data copy(obs,vel,fin,fout,feq,rho,vec,wt,col1,col2,col3,vin)
C=====main iteration loop
      do it = 1, nits

!$acc kernels loop independent
      do k = 1,nk
!$acc loop independent
      do j = 1,nj
!$acc loop independent
      do i = 1,ni

C=====end main iteration loop
!$acc end data

Does this still occur in the new version?

I can’t really say why this would occur in your particular case without a reproducible example. Though things like numerical instability, uninitialized memory, or most likely a race condition could be at fault.

Finally got something to run quickly on the GPU, but had to split the large nested do-loop into 2, with the (last) streaming step in a separate nested do-loop at the end. It runs super fast compered with serial - I’m just doing the serial timing right now for comparison. But amazingly the only mods to the code are these shown below:

this snip-it is basically the same as what I posted earlier so I’m not sure what you did, but glad it’s getting the performance improvement you were expecting.

-Mat