Finding Minimum Values using Parallel Loops

Hi all,

I have a double-nested loop in Fortran that computes some results from values stored in an array and stores the minimum values.

I was wondering if there was any way of using GPU parallelisation to speed this up? PGPROF reports a compute intensity of 4.17 and it is a major chunk of run-time for my program.

I tried splitting the loop to store the results in temporary arrays (then looking through these to find the minimum value), shifting the IF statements to outside the main loop in order to remove the scalar dependency, but this resulted in privisation of these arrays prevent parallelisation.

Is there a better way to go about this, or is it a situation not geared towards parallelisation due to the need to store all the results.

Chris


      DO 200 KWALL  = KS,KE,1
      KM1 = KWALL-1
      IF(KM1.LT.1)    KM1 = 1
      KP1 = KWALL
      IF(KP1.GT.KMM1) KP1 = KMM1
      DO 200 JWALL  = JS,JE,1
      JM1 = JWALL-1
      IF(JM1.LT.1)    JM1 = 1
      JP1 = JWALL
      IF(JP1.GT.JMM1) JP1 = JMM1
!
!      FIRST THE I = 1 WALL
!
      FSOLID = 1.0 -0.25*(MWALLI1(JM1,KM1,NBLCK)+MWALLI1(JP1,KP1,NBLCK) &
                        + MWALLI1(JM1,KP1,NBLCK)+MWALLI1(JP1,KM1,NBLCK))
      FSOLID = FSOLID*I1_SHEAR(NBLCK)
      XD  = X(1,JWALL,KWALL,NBLCK)  - XP
      RD  = R(1,JWALL,KWALL,NBLCK)  - RP
      RTD = RT(1,JWALL,KWALL,NBLCK) - RTP
      DISTSQ = XD*XD + RD*RD + RTD*RTD
      DISTSQ = FSOLID*DISTSQ + (1.-FSOLID)*DLMINSQ
      IF(DISTSQ.LT.DMINSQ) THEN
      DMINSQ = DISTSQ
      IMIN  = 1
      JMIN  = JWALL
      KMIN  = KWALL
      XDMIN = XD
      RDMIN = RD
      RTDMIN= RTD
      IF_FOUND = 1
      ENDIF
!
!     NEXT THE I = IM WALL.
!
      FSOLID = 1.0 -0.25*(MWALLIM(JM1,KM1,NBLCK)+MWALLIM(JP1,KP1,NBLCK) &
                        + MWALLIM(JM1,KP1,NBLCK)+MWALLIM(JP1,KM1,NBLCK))
      FSOLID = FSOLID*IM_SHEAR(NBLCK)
      XD  = X(IM,JWALL,KWALL,NBLCK)  - XP
      RD  = R(IM,JWALL,KWALL,NBLCK)  - RP
      RTD = RT(IM,JWALL,KWALL,NBLCK) - RTP
      DISTSQ = XD*XD + RD*RD + RTD*RTD
      DISTSQ = FSOLID*DISTSQ + (1.-FSOLID)*DLMINSQ
      IF(DISTSQ.LT.DMINSQ) THEN
      DMINSQ = DISTSQ
      IMIN  = IM
      JMIN  = JWALL
      KMIN  = KWALL
      XDMIN = XD
      RDMIN = RD
      RTDMIN= RTD
      IF_FOUND = 1
      ENDIF
!
  200 CONTINUE

Hi Chris,

s there a better way to go about this, or is it a situation not geared towards parallelisation due to the need to store all the results.

You should be able parallelize this code. Whether you achieve speed-up may be in question, but at least with OpenACC it’s not much work to find out.

The problem with this code is the min reduction. If you we’re just looking for the min value of each these values (independent of the others) then you could just use an OpenACC reduction. However, since you want to keep track of the values when DISTSQ is at it’s min, you’ll need to keep track of all the values (i.e. manually privatizing these by turning them into temp arrays) and perform the min reduction after. Either on the host, which means more data movement, or in a sequential kernel on the device.

but this resulted in privisation of these arrays prevent parallelisation.

I’m guessing this because you only privatized on the KWALL index but are parallelizing both the KWALL and JWALL loops. In this case, you’ll need to either add a second dimension to the temp arrays for the JWALL index or just parallelize the outer KWALL loop,

Hope this helps,
Mat