Error if "private" not on same line as "parallel loop"

when I used this statement layout, it ran ok on GPU

!$acc parallel loop collapse(3) private(E1,E2,F1,F2,G1,G2)
!$acc& reduction(+:resid)

but when I put private on next line it crashed

!$acc parallel loop collapse(3) 
!$acc& private(E1,E2,F1,F2,G1,G2)
!$acc& reduction(+:resid)

so why does this happen please - is my syntax wrong?

What’s the actual error? By “crashed” do you mean a compiler error or runtime crash?

You’re using fixed format continuation (i.e. files with the “.f” extension) which would cause errors with free format (files with “.f90”).

Free format continuation would look like:

!$acc parallel loop collapse(3) &
!$acc private(E1,E2,F1,F2,G1,G2) &
!$acc reduction(+:resid)

Hi Mat,

There was a segmentation fault at runtime, but it seems to have been a
spurious error which was not reproduceable. So the the above code (with !$acc&) does work fine now.

I am compiling on my laptop and then copying the executable to a cloud platform, where I am able to use a GPU machine, so maybe there is some variability there, in terms of compatibility (of my executable) with whichever machine its running on. I don’t know.

Thanks for your reply though.

There was a segmentation fault at runtime, but it seems to have been a spurious error which was not reproduceable.

Ok, though keep in mind that seg faults occur in host code, not device code. If it happens again, try using a debugger like gdb to determine where it’s coming from.

I am compiling on my laptop and then copying the executable to a cloud platform,

By default, the compiler will target the device on the same system as it’s being compiled. If you’re not doing so already, be sure to set the compute capability flag if the cloud GPU is different from your laptop. You can also set multiple target device so the binary is portable across multiple devices.

For example: “-gpu=cc70,cc80,cc90” will target Volta, Ampere, and Hopper architectures.

That’s useful to know - I have been using only -gpu=cc75 so far, I don’t know which one that corresponds with.

I have a related issue because there are several parallelised loops in my code… The first one runs fine normally, but if I remove the parallelisation (ie. comment out the !$acc lines) it generates Nans when it runs (but only on the GPU machine - not locally). So I cant understand how removing the parallelisation of one loop would create any problems. I realise this is a vague question - but what should I look for ?

!$acc parallel loop collapse(1)
!$acc& reduction(max:maxvel)

      DO c = 1, ncutc

        i = cutc(c,1)
        j = cutc(c,2)
        k = cutc(c,3)

cc75 is the Turing architecture.

I cant understand how removing the parallelisation of one loop would create any problems.

Most likely you’re not synchronizing the host and device memories. For example if you update an array on the device, and then use it on the host without adding an “update self” directive, then the host will be using old data.

Performance-wise you’ll want to move as much of the computation to the GPU and limit data movement, but for debugging and development you might consider using CUDA Unified Memory (UM) by adding the flag “-gpu=managed”.

With UM, the CUDA driver will take care of the data movement for you. The caveat is that only allocated memory can currently be used so fixed size arrays and scalars still need to be managed via the OpenACC data and update directives.

thats interesting - so I added this line before the loop which I was trying to run on host, but it didn’t seem to make any difference to all the NaN errors though.

!$acc update self(r,u,v,w,p,t,m)

Ok, then I’ll need a minimal reproducing example in order to help further. Otherwise it’s guess work.

-Mat

ok… Im not sure about how to create a minimal reproducing example…

I discovered that I can resolve the problem by moving the “data copy” statement so that the ‘host’ loop is not within the data copy region.
That means moving the data copy commands to within the main iteration loop, instead of outside it.

But that seem to slow things down a lot (unless that is because the host loop is no longer parallelised). I am not using any “present” commands in the loop statements. Im not using any “update self” commands now either.

It still sounds like you’re missing an update clause someplace.

What I usually do in these cases, in particular when porting code I don’t know well or complex code, is to put a “update device” before all compute regions and then an “update self” after. Performance will be poor due to the extra data movement, but it should ensure all data is synchronized. Then start to move the update directives outward, widening them until they merge with another update, and then delete them. This should be done incrementally so once it starts to fail, you’ve pinpointed which updates are necessary.

Of course I’m still just guessing so if you need additional help, I will really need more at least detailed code snips, a minimal reproducing example, or even the full code if you’re able to share.

Thanks - so do you have to put all the variables after both commands?
I will give that a try next and see how it goes.

!$acc update device(u,v,w,p,r,t,m)
<device loop>
!$acc update self(u,v,w,p,r,t,m)

I have some scalar variables in my first loop collapse (3) (which I assumed were “private” by default), but when I declared them as private, I got a different result. There was no mention of them in the -Minfo output. So now I’m confused as to why they would (might have) been treated as “shared” by default? There are several variables which are increased in a loop within the main loops like this psum=psum+pp(i,j,k)... usum=usum+uu(i,j,k)... etc

Again, having at least a code snip-it to see what you’re doing in context would be helpful but from what you show it appears psum and usum appear to be reduction variables which have a different set of rules.

You’ll want to use “reduction(+:psum,usum)” instead of putting them in a private clause.

OK, so here is a code snippet, the summed variables are only summed around the 6 neighbour cells (nei =1,6) which is inside the cutcell loop. I didnt realise I needed to specify them as reduction - that is interesting.

C=====
C11111 cut-cells
C=====
!$acc parallel loop collapse(1)
!$acc& reduction(max:maxvel)

      DO c = 1, ncutc

        i = cutc(c,1)
        j = cutc(c,2)
        k = cutc(c,3)

C-------original cut-cell method (6 neighbours)
        do nei = 1,6

          ii = i
          jj = j
          kk = k

          if (nei.eq.1) ii = i-1 
          if (nei.eq.2) ii = i+1 
          if (nei.eq.3) jj = j-1 
          if (nei.eq.4) jj = j+1 
          if (nei.eq.5) kk = k-1 
          if (nei.eq.6) kk = k+1 

          if ( cell_lcd(ii,jj,kk).EQ.1 ) then ! average from live cells
            ccln = ccln + 1
            rsum = rsum + r(ii,jj,kk)
            psum = psum + p(ii,jj,kk)
            tsum = tsum + t(ii,jj,kk)
            usum = usum + u(ii,jj,kk)
            vsum = vsum + v(ii,jj,kk)
            wsum = wsum + w(ii,jj,kk)
            msum = msum + m(ii,jj,kk)
          endif

        enddo

      END DO ! cutcell loop

Ok, in this case you don’t want to use a reduction clause for these variables. Reduction is when you need to do the sum and have the results after the end of the parallel loop, Here the summation is in the body of the loop.

Why you get differing results when explicitly adding them to a private clause is not clear. Can you post the compiler feedback messages (i…e -Minfo=accel) for this loop with and without using “private”? That might give some clues.

          r(i,j,k) = rsum/real(ccln)  !  set dens at cut-cell
          p(i,j,k) = psum/real(ccln)  !  set pres at cut-cell
          t(i,j,k) = tsum/real(ccln)  !  set temp at cut-cell
          m(i,j,k) = msum/real(ccln)  !  set visc at cut-cell

For this bit, given i, j, and k’s values are taken from a look-up table, do you know if all <i,j,k> indices are unique and you have no collisions? If you do have collisions, the order in which the arrays are updated is non-deterministic and could cause different results.

-Mat

Hi Mat, I put this on the same thread as before, although it more of a general speedup optimisation question…

I have got a speedup of about 50 between acc=host (serial) and the -acc gpu=cc75 (gpu parallel). I have tried various things, but it seems to work best when I use the -O3 optimisation flag. I have ordered the IJK loop in reverse (KJI). I have tried various options that you suggested before in terms of loop gang, worker, vector, independent etc etc. It seems to work best with the standard

!$acc parallel loop collapse(3)

So my question is - what else I can do to improve the performance of my fortran code. I have tried to include as much as possible into one single ijk loop, to minimise the communication, which seems to work ok now.

If you have an email address I could send a copy of the main solver loop for you to have a look at, but rather not post it on the forum here yet.

Probably you’re next step is to use Nsight-Compute to do a low-level hardware profile of the kernel.

This will show how effectively the kernel is at utilizing the GPU and if/where there may be bottlenecks. Look at the SOL (speed-of-light) % to know if the code is compute or memory bound.

I’ll often first look at the difference between the theoretical occupancy and the achieved occupancy. If there is a large difference, it usually means that the warps are stalled or there’s not enough work. For warp stalls, typically this is due to them waiting for memory which means you can improve the memory access of data (i.e. you want the threads, aka vectors, to access data across the stride-1 dimension of the array). Though warps can also stall on other units such as the FPU.

I can’t really give a full tutorial via the UF, but if you have specific questions, I’ll do my best to help.

If you have an email address I could send a copy of the main solver loop for you to have a look at, but rather not post it on the forum here yet.

Understood. You can direct message me by clicking on my name and selecting the “message” button.

OK so I first tried using ncu on my laptop (which has no gpu), and I suppose unsurprisingly ncu finished by reporting…

==PROF== Target process 14915 terminated before first instrumented API call.
==WARNING== No kernels were profiled.

So, does that mean that I have to run ncu (remotely) on the cloud platform where I run the solver on gpus? …or is there any other way to profile without actually using gpus?

…or is there any other way to profile without actually using gpus?

The command line interface, ncu, profiles the low level hardware counters so does need access to a GPU.

However, the Nsight-Compute GUI does have the ability to run ncu on a remote system.

I personally only use ncu, so will need to point you the documentation for details: Nsight Compute :: Nsight Compute Documentation