Hello,
I am currently trying to port our “gmq molecular dynamics code” from gfortran-4.4.5+openMP to PGI+accelerator model.
While I succeed in compiling with PGI 11.10, performance is not satisfying in OpenMP
The main subroutine is FORCE.f
Here is the compiling command-line (some “defines” are removed ):
gfortran scalar : gfortran -O2 -c -o FORCE.o2 FORCE.f
gfortran openMP: gfortran -O2 -fopenmp -c -o FORCE.o2 FORCE.f
PGI scalar: pgfortran -Minfo -fast -O2 -Mpreprocess -c -o FORCE.o2 FORCE.f
PGI openMP: pgfortran -Minfo -mp -Mpreprocess -c -o FORCE.o2 FORCE.f
as explained in the man pages, ‘-mp’ is also used in the linking phase.
Hopefully, all 4 versions gave the same results, but timings differ:
gfortran-scalar : 4.06 s
pgi-scalar : 3.727 s (that’s fine)
gfortran-openMP : 2.93 s (on 4 CPUS, x1.3 speedup. Modest, that’s not the point right now)
pgi-openMP : 11.453 s (x0.32 !!!???)
I probably miss something, isn’t it ?
Here is the code section (full source code can be found here: http://nicolas.charvin.free.fr/gmq/FORCE-prepro.f.gz)
(note: I only wish to port the code, and unfortunately, I still do not fully understand it :)
SUMFX=0.0
SUMFY=0.0
SUMFZ=0.0
!============== NC : this is the inner-most intensive loop
!$omp parallel do schedule(auto)
!$omp+ reduction(+:ENVDW,VIRVDW,A11,A12,A13,A22,A23,A33)
!$omp+ reduction(+:REALST,SUMFX,SUMFY,SUMFZ)
!$omp+ private(K,J,JTYP,XD,YD,ZD,XDD,YDD,ZDD,RSQ,FR1,IJTYP)
!$omp+ private(R2,R6,R12,R,ALR,EX,T,EAL,QIQJ,REALSE)
!$omp+ private(FX1,FY1,FZ1)
!$omp+ shared(FX,FY,FZ)
DO 190 K = LSTART,NLIST(II)
J = LIST(K)
JTYP = ITYPE(J)
XD = XQ(I) - XQ(J)
YD = YQ(I) - YQ(J)
ZD = ZQ(I) - ZQ(J)
XDD = XD - INT(XD)*TWO
YDD = YD - INT(YD)*TWO
ZDD = ZD - INT(ZD)*TWO
XD = H(1,1)*XDD + H(1,2)*YDD + H(1,3)*ZDD
YD = H(2,1)*XDD + H(2,2)*YDD + H(2,3)*ZDD
ZD = H(3,1)*XDD + H(3,2)*YDD + H(3,3)*ZDD
RSQ = XD*XD + YD*YD + ZD*ZD
FR1 = 0.0
IJTYP = INBTYP(ITYP,JTYP)
IF (RSQ.LT.CUT2TY(IJTYP)) THEN
R2 = SIG2TY(IJTYP)/RSQ
R6 = R2*R2*R2
R12 = R6*R6
ENVDW = ENVDW+FEPSTY(IJTYP)* (R12-R6) + OFEPST(IJTYP)
FR1 = F2EPST(IJTYP)*R2* (R12-HALF*R6)
VIRVDW = VIRVDW - FR1*RSQ
END IF
IF (LOGQ) THEN
IF (RSQ.LT.CUTQ2 .AND. Q(J).NE.ZERO) THEN
R = SQRT(RSQ)
ALR = ALPHA*R
EX = EXP(-ALR**2)
T = ONE/ (ONE+P*ALR)
EAL = ((((A5*T+A4)*T+A3)*T+A2)*T+A1)*T*EX
QIQJ = Q(I)*Q(J)*E2DFPE
REALSE = EAL*QIQJ/R
REALST = REALST + REALSE
FR1 = FR1 + (REALSE+ALPI*QIQJ*EX)/RSQ
END IF
END IF
IF (FR1.NE.ZERO) THEN
FX1 = FR1*XD
FY1 = FR1*YD
FZ1 = FR1*ZD
SUMFX = SUMFX + FX1
FX(J) = FX(J) - FX1
SUMFY = SUMFY + FY1
FY(J) = FY(J) - FY1
SUMFZ = SUMFZ + FZ1
FZ(J) = FZ(J) - FZ1
A11 = A11 + XD*FX1
A12 = A12 + XD*FY1
A13 = A13 + XD*FZ1
A22 = A22 + YD*FY1
A23 = A23 + YD*FZ1
A33 = A33 + ZD*FZ1
END IF
190 CONTINUE
FX(I) = FX(I)+SUMFX
FY(I) = FY(I)+SUMFY
FZ(I) = FZ(I)+SUMFZ
Next, I wish I could “translate” this openMP region to an !$acc region. Actually, I succeed in compiling it, but it is very slow, and results differ from CPU-only code.
This probably will the topic of another post.
Thanks a lot for your help
Best regards,
nico[/url]