Converting OpenMP from multicore to GPU question

dess.brian · July 7, 2021, 9:15pm

Hello again all!

I got some good advice before on getting started with the nvfortran compiler coming from a ifort compiler. So far, it’s working finally. I’ve been working on converting some multicore openmp code to make use of the GPU instead. I’m having trouble telling the code what variables I want reduced as well as setting the scheduling (if possible). The original multicore code was:
!$OMP PARALLEL
!$OMP& DEFAULT(SHARED)
!$OMP& PRIVATE(I,DDOTA,DTEMP)
!$OMP& SCHEDULE(STATIC)
!$OMP& REDUCTION(+:CORR2,CORR1,SUB2CT,SUB1CT,DSUM,SCORE)

which worked fine.

Now, I’m trying to get the GPU to use effectively the same restrictions and all, and the documentation was a big help in getting the code to execute at all. What I have that compiles (and doesn’t throw a syntax error) is:

!$OMP Target parallel loop default(shared) private(I,DDOTA,DTEMP)

But when I try to put in the schedule and reduction lines, the compiler throws the syntax errors. I’m pretty sure I’m doing something wrong, but trying to crowbar in the reduction or schedule statements causes the error.

I think I need those definitions in there since the result I’m getting from the overall program is off by quite a bit, and reminds me of the results I got when playing around with the multicore code a long time ago (not “totally bonkers” but off by enough not to be just a rounding error).

Any help is appreciated! Happy to say I think we’ll be ditching the intel compiler at this point… strongly considering getting a AMD cpu now…

MatColgrove · July 8, 2021, 5:00pm

What’s the syntax error?

I’m guessing it might be due to the mixed syntax, but not sure. Can you try using “teams” in place of “parallel”? For example:

% cat test.f90

Program test
  implicit none
  real, allocatable :: array(:)
  real :: res
  integer :: i
  allocate(array(1024))
  array = 1.0
  res=0.0

!$OMP target teams loop default(shared) reduction(+:res) map(to:array)
  do i=1,1024
    res = res + array(i)
  enddo
  print *, res



end program
% nvfortran -mp=gpu test.f90 -Minfo=mp; a.out
test:
     11, !$omp target teams loop
         11, Generating "nvkernel_MAIN__F1L11_1" GPU kernel
             Generating Tesla code
           12, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
               Generating reduction(+:res)
         11, Generating Multicore code
           12, Loop parallelized across threads
     11, Generating implicit map(tofrom:res)
         Generating map(to:array(:))
    1024.000

Alternately, you can use “distribute parallel do”, but we’re recommending folks start with “loop” and move to “distribute” only if algorithmically it can’t be done with “loop”. “loop” gives the compiler more freedom on scheduling so can give better performance portability and often better performance overall.

% cat test2.f90

Program test2
  implicit none
  real, allocatable :: array(:)
  real :: res
  integer :: i
  allocate(array(1024))
  array = 1.0
  res=0.0

!$OMP target teams distribute parallel do default(shared) reduction(+:res) map(to:array)
  do i=1,1024
    res = res + array(i)
  enddo
  print *, res

end program
% nvfortran -mp=gpu test2.f90 -Minfo=mp ; a.out
test2:
     11, !$omp target teams distribute parallel do
         11, Generating Tesla and Multicore code
             Generating "nvkernel_MAIN__F1L11_1" GPU kernel
     11, Generating implicit map(tofrom:res)
         Generating map(to:array(:))
    1024.000

Hope this helps,
Mat

dess.brian · July 8, 2021, 7:27pm

[EDIT]: I found what my problem was, it was actually unrelated. But your advice really helped to get this GPU execution into a optimal state and running properly. Thanks again for the help!

dess.brian · July 8, 2021, 8:58pm

Of course, it is running, but still returning a different answer from the CPU multicore code. Could it have to do with the precision of the threads recombining their values? I recall the CPU had the same issue, a minor drift in the accuracy due to having to recombine values. It could just be a compiler flag issue though. What fixed the CPU error issue was enabling “-fp-model precise” which resulted in answers that were close to the single core original code. I don’t know if the nvfortran has a similar flag in it.

Currently, I’m doing:

nvfortran $srcnam -O2 -gpu=flushz -mp=gpu -Minfo=mp -o $unam -c -w

The short of what the code is actually doing is colossal dot products, if that matters. It is effectively looping multiple dot products over and over, building on the previous results. Thus the reason I’m asking about precision since that minor loss of precision at the beginning cascades through the entire calculation. The results I get are not “awful” but they are significantly off from the “correct” value. It may just be the nature of the beast though.

A slightly different problem, but I appreciate any help all the same! Still just happy that I am getting it to run with the right OMP declarations… I imagine that is set correctly now.

MatColgrove · July 8, 2021, 9:37pm

Possible but more likely would be due rounding error and the order of operations. With parallel reductions, the order in which the operations are performed can cause differences in the accumulated rounding error, thus producing differing results. Though, it’s usually not vastly different so if they are just slightly different, then this could be the cause. FMA (Fuse-Multiply Add) operations can also contribute to differing values.

Our equivalent to “-fp-model precise” would be “-Kieee” so you can try that. Though this won’t help with the order of operation issue.

You can also try disabling FMA (-Mnofma) but typical only matters if the CPU doesn’t have FMA enabled. FMA is actually more precise since the add and multiple are performed in a single operation and thus has less rounding error, but can give divergent results when compared without FMA.

If you’re using single precision, you might try using double instead. If declaring you’re variables as just “REAL”, you can change the default kind to REAL(8) by setting the flag “-r8”. Though this is universal, so using “REAL(8)” explicitly for the just summation variable may help as well.

dess.brian · July 9, 2021, 2:01am

Thanks for the suggestions.

I tried the -Kieee, and I noticed that I am still getting a “Warning: ieee_inexact is signaling” when the particular code executes, so I’m guessing there is still some precision problems in the code somewhere. I didn’t see any difference when I tried disabling FMA either, and the -r8 exploded the code haha (not unexpected, the code is very large with many other sections doing a number of tasks). So I’m not sure what else could be causing the problem besides the Operations or some inherent precision loss.

I suppose a question I should ask, given I’m effectively running the same set of parameters as I was on the CPU intel compiler (which, I suppose another query for me is to have nvfortran compile for multicore and see what that does), which would you think is more precise at the end of the day? There are some built in values I have being “targeted” in the overall math, and I noticed that the GPU is almost dead on those values while we ran with the “better” results from the CPU was just a artifact of the process. We are generating thousands of results at a time, so it could be that these are “more correct” than the CPU predecessor, but I’m not knowledgeable enough to realize if the GPU is more likely correct over the CPU (assuming it is running correctly, it isn’t behaving like a “catastrophic failure” usually does). The math is dancing on a razors edge, so normally blatantly “wrong” math results in nonsense.

Thanks again for helping me with this stuff.

MatColgrove · July 9, 2021, 3:22pm

At least at full optimization, I’ve always found us being more precise (we try to stay within 1ULP) than Intel. Though running massively parallel can change things due to the order of operations issue.

When you say that using -r8 “exploded” the code, do you mean memory usage or the results changed significantly? If the results changed, then this seems to imply some numerical instability in the code.

Not sure what the next steps would be, but maybe try targeting just the GPU section and use double precision there (i.e. explicitly use REAL(8))?

-Mat

dess.brian · July 9, 2021, 4:06pm

That’s what I suspected. I suppose it may just be a tradeoff. Thanks for letting me bounce that off you.

The -r8 just wouldn’t compile at all and threw a ton of errors in areas not related to this particular section of code. I don’t think it is a problem in the code itself, just that the code is old (70’s) and “controlled” so tightly that it wouldn’t surprise me if other areas are having problems with double precision. I agree that it is probably ideal just to have the parallel segment work with double precision if I can do that.

I’ll try playing around with a few more things to try and get that precision stuff handled, but it may boil down to deciding if we want higher speed or greater precision. Either way, thanks for your time and help! I learned quite a bit. I’ll still follow this topic if you have another idea, but I’ll quit pestering you on it haha!

Thanks again.

system · September 7, 2021, 4:06pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.