Poor OpenMP performance, compared to GCC gfortran

Hello,

I am currently trying to port our “gmq molecular dynamics code” from gfortran-4.4.5+openMP to PGI+accelerator model.
While I succeed in compiling with PGI 11.10, performance is not satisfying in OpenMP
The main subroutine is FORCE.f

Here is the compiling command-line (some “defines” are removed ):

gfortran scalar : gfortran -O2 -c -o FORCE.o2 FORCE.f
gfortran openMP: gfortran -O2 -fopenmp -c -o FORCE.o2 FORCE.f
PGI scalar: pgfortran -Minfo -fast -O2 -Mpreprocess -c -o FORCE.o2 FORCE.f
PGI openMP: pgfortran -Minfo -mp -Mpreprocess -c -o FORCE.o2 FORCE.f

as explained in the man pages, ‘-mp’ is also used in the linking phase.

Hopefully, all 4 versions gave the same results, but timings differ:

gfortran-scalar : 4.06 s
pgi-scalar : 3.727 s (that’s fine)
gfortran-openMP : 2.93 s (on 4 CPUS, x1.3 speedup. Modest, that’s not the point right now)
pgi-openMP : 11.453 s (x0.32 !!!???)

I probably miss something, isn’t it ?

Here is the code section (full source code can be found here: http://nicolas.charvin.free.fr/gmq/FORCE-prepro.f.gz)
(note: I only wish to port the code, and unfortunately, I still do not fully understand it :)


            SUMFX=0.0
            SUMFY=0.0
            SUMFZ=0.0

!============== NC : this is the inner-most intensive loop
!$omp parallel do schedule(auto)
!$omp+ reduction(+:ENVDW,VIRVDW,A11,A12,A13,A22,A23,A33)
!$omp+ reduction(+:REALST,SUMFX,SUMFY,SUMFZ)
!$omp+ private(K,J,JTYP,XD,YD,ZD,XDD,YDD,ZDD,RSQ,FR1,IJTYP)
!$omp+ private(R2,R6,R12,R,ALR,EX,T,EAL,QIQJ,REALSE)
!$omp+ private(FX1,FY1,FZ1)
!$omp+ shared(FX,FY,FZ)

            DO 190 K = LSTART,NLIST(II)
               J = LIST(K)
               JTYP = ITYPE(J)
               XD = XQ(I) - XQ(J)
               YD = YQ(I) - YQ(J)
               ZD = ZQ(I) - ZQ(J)
               XDD = XD - INT(XD)*TWO
               YDD = YD - INT(YD)*TWO
               ZDD = ZD - INT(ZD)*TWO
               XD = H(1,1)*XDD + H(1,2)*YDD + H(1,3)*ZDD
               YD = H(2,1)*XDD + H(2,2)*YDD + H(2,3)*ZDD
               ZD = H(3,1)*XDD + H(3,2)*YDD + H(3,3)*ZDD
               RSQ = XD*XD + YD*YD + ZD*ZD
               FR1 = 0.0
               IJTYP = INBTYP(ITYP,JTYP)
               IF (RSQ.LT.CUT2TY(IJTYP)) THEN
                  R2 = SIG2TY(IJTYP)/RSQ
                  R6 = R2*R2*R2
                  R12 = R6*R6
                  ENVDW = ENVDW+FEPSTY(IJTYP)* (R12-R6) + OFEPST(IJTYP)
                  FR1 = F2EPST(IJTYP)*R2* (R12-HALF*R6)

                  VIRVDW = VIRVDW - FR1*RSQ
               END IF
               IF (LOGQ) THEN
                  IF (RSQ.LT.CUTQ2 .AND. Q(J).NE.ZERO) THEN
                     R = SQRT(RSQ)
                     ALR = ALPHA*R
                     EX = EXP(-ALR**2)
                     T = ONE/ (ONE+P*ALR)
                     EAL = ((((A5*T+A4)*T+A3)*T+A2)*T+A1)*T*EX
                     QIQJ = Q(I)*Q(J)*E2DFPE
                     REALSE = EAL*QIQJ/R
                     REALST = REALST + REALSE
                     FR1 = FR1 + (REALSE+ALPI*QIQJ*EX)/RSQ
                  END IF
               END IF
               IF (FR1.NE.ZERO) THEN

                  FX1 = FR1*XD
                  FY1 = FR1*YD
                  FZ1 = FR1*ZD

                  SUMFX = SUMFX + FX1
                  FX(J) = FX(J) - FX1

                  SUMFY = SUMFY + FY1
                  FY(J) = FY(J) - FY1

                  SUMFZ = SUMFZ + FZ1 
                  FZ(J) = FZ(J) - FZ1

                  A11 = A11 + XD*FX1
                  A12 = A12 + XD*FY1
                  A13 = A13 + XD*FZ1
                  A22 = A22 + YD*FY1
                  A23 = A23 + YD*FZ1
                  A33 = A33 + ZD*FZ1
               END IF
  190 CONTINUE

            FX(I) = FX(I)+SUMFX 
            FY(I) = FY(I)+SUMFY
            FZ(I) = FZ(I)+SUMFZ

Next, I wish I could “translate” this openMP region to an !$acc region. Actually, I succeed in compiling it, but it is very slow, and results differ from CPU-only code.
This probably will the topic of another post.

Thanks a lot for your help
Best regards,

nico[/url]

Hi Nico,

It looks like you forgot to add “-fast” to the PGI OpenMP version, hence the code is minimally optimized. Can you try again with:

PGI openMP: pgfortran -Minfo -mp -fast -Mpreprocess -c -o FORCE.o2 FORCE.f

It that doesn’t make a difference, can you make the complete application, or at least a reproducing example available?

Next, I wish I could “translate” this openMP region to an !$acc region. Actually, I succeed in compiling it, but it is very slow, and results differ from CPU-only code.

Sure. Typically, slow downs occur when the data movement has not yet been addressed (via data regions or the mirror clause), there isn’t enough parallelization (small workloads, small loop counts), and/or on device memory access.

For data movement, look at the output from “-Minfo=accel” and note where the data is copied. If the data movement is inside a host loop, can it be hoist out using a data region? If the accelerator compute region is inside a subroutine that’s called many times, can a data region be used in the caller routine and the device arrays be made “reflected”? If using module arrays in your compute region, can you use the “mirror” directive to allocate them on the device and then use the “update” directive to better manage when the data is copied?

For parallelization, compile with the flag “-ta=nvidia,time” and review the profile output. What schedule is being using? (i.e. the grid and block size). Can you move the compute region to an outermost loop (assuming it’s in an innermost). Which compute regions are taking the most time? How much time is spent copying data?

Finally, review the -Minfo=accel information for any “Non-stride 1” messages. This means that your data is non-contiguous and will lead to poor data access on the device. You many need to rearrange your data structure, loops, or use accelerator loop directives to ensure that your “vector” index is the left most (column) dimension.

In looking at your code, it’s most likely a data movement problem. The “K” (190) is an inner loop so for every iteration of the outer “II” (200), you are copying the data each time. Putting a data region around the “II” loop should help a lot.

  • Mat

Hi Mat,
Thanks for your quick reply.

  • About openMP issue:
    I add the missing “-fast -O2” flags
...
pgfortran -Minfo -fast -O2 -mp -Mpreprocess -I./ -DLJEXV  -c -o FORCE.o2 FORCE.f
pgfortran -Minfo -fast -O2 -mp  GMULQ.o2 ANGLE.o2   BOXCH.o2 CDTIME.o2 CONRIT.o2 CUTCOR.o2  FORCE.o2 HARMON.o2 INVERT.o2 MAPCEL.o2 NEWPOS.o2 RECIPR.o2 RITIND.o2 SETCUT.o2 SHAKE.o2 TORSIO.o2    OOP.o2 ADDNEI.o    BOX.o CHKDAT.o CONFIG.o DATALI.o DATRIT.o GETNUM.o GOOD.o   HEAD.o   HRIT.o  PRINT.o  PTRIT.o RDCONF.o RDPARM.o SETCH2.o SETCH3.o SETIND.o SETLIS.o SUMMOM.o TEMCAL.o  TENAV.o TENRIT.o TEXNUM.o TSCALE.o UPDATE.o WRTCON.o DATREA.o CHKCON.o -o gmq_pgi_openMP

but it makes no differences: still ~11 seconds, vs ~2.9s for gfortran, both on 4 active cpus (checked with top).

I surely wish to send you the complete source code and a working dataset, since I guess it is the fastest and more efficient way to solve this.
But I’d rather send it privately. Would you please send me an e-mail or a private message for me
to give you access to a tar file. Of course, I will publish the solution and any further discussions to this forum.

  • About the Accelerator model:
    Again, thanks a lot for you advices. Maybe I also should move this discussion to the “Accelerator Programming” forum
    Surely, one issue is data movement, as shown in the timings below:
administrateur@crunch:$  time ../src_stable_pgi/gmq_pgi_acc

real	6m17.970s
user	4m20.450s
sys	1m57.070s

Accelerator Kernel Timing data
/home/administrateur/Prog/gmq/src_stable_pgi/FORCE.f
  force
    451: region entered 190000 times
        time(us): total=377551710 init=68345 region=377483365
                  kernels=8941263 data=229441983
        w/o init: total=377483365 max=2694 min=1870 avg=1986
        457: kernel launched 189924 times
            grid: [1-2]  block: [256]
            time(us): total=7980185 max=55 min=36 avg=42
        562: kernel launched 189924 times
            grid: [12]  block: [256]
            time(us): total=961078 max=20 min=5 avg=5

Nevertheless, before addressing performance issues, I’d rather solve the correctness issue.
Results are different whether I use host CPU or GPU accelerator (Device Name: GeForce GTX 560, Device Revision Number: 2.1, Global Memory Size: 1072889856, Number of Multiprocessors: 7, Number of Cores: 224).

Here are the naive accelerator directives I have used so far

            SUMFX=0.0
            SUMFY=0.0
            SUMFZ=0.0

!$acc region copyin(ITYPE(1:MAXNAT), CUT2TY(1:MAXBND), F2EPST(1:MAXBND))
!$acc+ copyin(SIG2TY(1:MAXBND), FEPSTY(1:MAXBND), OFEPST(1:MAXBND))
!$acc+ copyin(XQ(1:MAXNAT),YQ(1:MAXNAT),ZQ(1:MAXNAT),FX(1:MAXNAT))
!$acc+ copyin(FY(1:MAXNAT), FZ(1:MAXNAT), Q(1:MAXNAT))
!$acc do independent

! It might not be clearly obvious, but loops are independent, by design

            DO 190 K = LSTART,NLIST(II)
               J = LIST(K)
               JTYP = ITYPE(J)
               XD = XQ(I) - XQ(J)
               YD = YQ(I) - YQ(J)
               ZD = ZQ(I) - ZQ(J)
               XDD = XD - INT(XD)*TWO
               YDD = YD - INT(YD)*TWO
               ZDD = ZD - INT(ZD)*TWO
               XD = H(1,1)*XDD + H(1,2)*YDD + H(1,3)*ZDD
               YD = H(2,1)*XDD + H(2,2)*YDD + H(2,3)*ZDD
               ZD = H(3,1)*XDD + H(3,2)*YDD + H(3,3)*ZDD
               RSQ = XD*XD + YD*YD + ZD*ZD
               FR1 = 0.0
               IJTYP = INBTYP(ITYP,JTYP)
               IF (RSQ.LT.CUT2TY(IJTYP)) THEN

                  VIRVDW = VIRVDW - FR1*RSQ
               END IF
               IF (LOGQ) THEN
                  IF (RSQ.LT.CUTQ2 .AND. Q(J).NE.ZERO) THEN
                     R = SQRT(RSQ)
                     ALR = ALPHA*R
                     EX = EXP(-ALR**2)
                     T = ONE/ (ONE+P*ALR)
                     EAL = ((((A5*T+A4)*T+A3)*T+A2)*T+A1)*T*EX
                     QIQJ = Q(I)*Q(J)*E2DFPE
                     REALSE = EAL*QIQJ/R
                     REALST = REALST + REALSE
                     FR1 = FR1 + (REALSE+ALPI*QIQJ*EX)/RSQ
                  END IF
               END IF
               IF (FR1.NE.ZERO) THEN

                  FX1 = FR1*XD
                  FY1 = FR1*YD
                  FZ1 = FR1*ZD
                  SUMFX = SUMFX + FX1
                  FX(J) = FX(J) - FX1
                  SUMFY = SUMFY + FY1
                  FY(J) = FY(J) - FY1
                  SUMFZ = SUMFZ + FZ1 
                  FZ(J) = FZ(J) - FZ1
                  A11 = A11 + XD*FX1
                  A12 = A12 + XD*FY1
                  A13 = A13 + XD*FZ1
                  A22 = A22 + YD*FY1
                  A23 = A23 + YD*FZ1
                  A33 = A33 + ZD*FZ1
               END IF
  190 CONTINUE
!$acc end region
C########################## nico
            FX(I) = FX(I)+SUMFX 
            FY(I) = FY(I)+SUMFY
            FZ(I) = FZ(I)+SUMFZ 
            LSTART = NLIST(II) + 1



pgfortran -Minfo -fast -O2 -Mpreprocess -I./ -DLJEXV  -c -o FORCE.o2 -ta=nvidia,time FORCE.f
force:
    176, Generated 3 alternate versions of the loop
         Generated vector sse code for the loop
    183, Loop not vectorized: data dependency
    204, Loop not vectorized/parallelized: too deeply nested
    213, Loop unrolled 4 times
    224, Loop unrolled 4 times
    232, Loop not vectorized/parallelized: contains call
    269, Invariant if transformation
    280, Loop unrolled 4 times
    414, Loop not vectorized/parallelized: contains call
    451, Generating copyin(fz(:maxnat))
         Generating copyin(fy(:maxnat))
         Generating copyin(fx(:maxnat))
         Generating copyin(inbtyp(ityp,:))
         Generating copyin(h(1:3,1:3))
         Generating copyin(zq(:maxnat))
         Generating copyin(yq(:maxnat))
         Generating copyin(xq(:maxnat))
         Generating copyin(itype(:maxnat))
         Generating copyin(list(lstart:nlist))
         Generating copyin(sig2ty(:maxbnd))
         Generating copyin(ofepst(:maxbnd))
         Generating copyin(fepsty(:maxbnd))
         Generating copyin(f2epst(:maxbnd))
         Generating copyin(cut2ty(:maxbnd))
         Generating copyin(q(:maxnat))
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    457, Loop is parallelizable
         Accelerator kernel generated
        457, !$acc do parallel, vector(256) ! blockidx%x threadidx%x
             Cached references to size [3x3] block of 'h'
             CC 1.3 : 64 registers; 2068 shared, 336 constant, 0 local memory bytes; 25% occupancy
             CC 2.0 : 62 registers; 2052 shared, 348 constant, 0 local memory bytes; 33% occupancy
        488, Sum reduction generated for envdw
        526, Sum reduction generated for virvdw
        537, Sum reduction generated for realst
        549, Sum reduction generated for sumfx
        552, Sum reduction generated for sumfy
        555, Sum reduction generated for sumfz
        557, Sum reduction generated for a11
        558, Sum reduction generated for a12
        559, Sum reduction generated for a13
        560, Sum reduction generated for a22
        561, Sum reduction generated for a23
        562, Sum reduction generated for a33
pgfortran -Minfo -fast -O2  GMULQ.o2 ANGLE.o2   BOXCH.o2 CDTIME.o2 CONRIT.o2 CUTCOR.o2  FORCE.o2 HARMON.o2 INVERT.o2 MAPCEL.o2 NEWPOS.o2 RECIPR.o2 RITIND.o2 SETCUT.o2 SHAKE.o2 TORSIO.o2    OOP.o2 ADDNEI.o    BOX.o CHKDAT.o CONFIG.o DATALI.o DATRIT.o GETNUM.o GOOD.o   HEAD.o   HRIT.o  PRINT.o  PTRIT.o RDCONF.o RDPARM.o SETCH2.o SETCH3.o SETIND.o SETLIS.o SUMMOM.o TEMCAL.o  TENAV.o TENRIT.o TEXNUM.o TSCALE.o UPDATE.o WRTCON.o DATREA.o CHKCON.o -ta=nvidia,time -o gmq_pgi_acc

That’s a long post! Thank you for reading it so far, and thank you for your help

Hi Nicholas,

Would you please send me an e-mail

Done. Though, please edit your post to remove your email address. I’d rather you not receive spam email. Note in the future, feel free to send attachments to PGI Customer Support (trs@pgroup.com). While I don’t look at support mail on a daily basis, if you let them know it’s for me, they will pass the code along.

but it makes no differences: still ~11 seconds, vs ~2.9s for gfortran, both on 4 active cpus

Well, I guess they all can’t be as easy. I’ll take a look once I get your code.

I’d rather solve the correctness issue. Results are different whether I use host CPU or GPU accelerator

Yes, getting wrong answers really fast doesn’t help much. One thing I notice is that you assign values to FY and FZ but are only copying them in. Shouldn’t you also copy this data back to the host? (i.e. change copyin to copy).

Note that wrong answers are often caused by incorrect data movement clauses. Hence, start by removing all the “copy” clauses. Then start adding them back in till you get the wrong answer.

  • Mat

One thing I notice is that you assign values to FY and FZ but are only copying them in. Shouldn’t you also copy this data back to the host? (i.e. change copyin to copy).

Dang!
Of course I should ! Once FX,FY,FZ arrays have been computed on the GPU, I need to copy them back. This a terrible mistake and I am quite ashamed. Changing ‘copyin’ to 'copy’for FX,FY and FZ works perfectly! Now results from CPU-only and Accelerator Model code are the same.

(actually, I’ve had to add these copy clauses, otherwise the compiler complain about “Accelerator restriction: one or more arrays have unknown size”. But I should have be more careful about inputs and outputs)

Now, let’s try all the advices you gave to speed-it up
Thanks a lot Mat!

As Mat pointed in his email, the openMP performance problem comes from the schedule clause in the !$omp parallel do directive.

uname -a: Linux crunch 2.6.35-32-generic #65-Ubuntu SMP Tue Jan 24 13:47:56 UTC 2012 x86_64 GNU/Linux
Intel® Core™ i7-2600 CPU @ 3.40GHz (4 cores)

gfortran (gcc version 4.4.5 )
schedule(auto): 2.88 s
schedule(dynamic): 6.42 s

pgfortran (pgfortran 11.10-0 64-bit target on x86-64 Linux -tp sandybridge)
without any schedule clause: 2.94 s
schedule(auto): 11.5 s
schedule(dynamic): 11.6 s
schedule(dynamic, 50): 3.14 s
schedule(guided): 4.914 s

Makes sense since “auto” defaults to the dynamic schedule. My times are as follows

Schedule  Gfortran   PGI
auto          3.44   12.51
static        3.49    3.39
blank         3.44    3.44

I also compiled with “schedule(runtime)” and adjusted the schedule via the “OMP_SCHEDULE” env variable. Note that there is a bit of extra overhead here.


Schedule   Gfortran     PGI
static        4.96      3.86
static,10     3.77      3.84
static,25     3.86      3.82
static,50     4.20      4.04
guided        4.61      5.26
guided,10     3.88      4.04
guided,25     3.89      3.93
guided,50     4.33      4.11
dynamic       6.92     13.33
dynamic,10    3.88      4.22
dynamic,25    3.77      4.10
dynamic,50    4.11      3.85

Note given the small run time it’s hard to make any conclusion except that “auto/dynamic” happens to be the one outlier for this particular workload.

  • Mat