OpenACC equivalent of OpenMP accellerated code

Hi,

I have a code part that parallelizes just fine, where I want it to, and runs correctly with OpenMP, however not with OpenACC. Any advises ?

OpenMP :

      !$omp parallel default(none) &
      !$omp&shared(NColor,indexL,itemL,indexU,itemU,AL,AU,D,ALU,perm,&
      !$omp&       NContact,indexCL,itemCL,indexCU,itemCU,CAL,CAU,&
      !$omp&       ZP,icToBlockIndex,blockIndexToColorIndex) &
      !$omp&private(SW1,SW2,SW3,X1,X2,X3,ic,i,iold,isL,ieL,isU,ieU,j,k,blockIndex)

    !C-- FORWARD
    do ic=1,NColor
    
      !$omp do schedule (static, 1)
      do blockIndex = icToBlockIndex(ic-1)+1, icToBlockIndex(ic)
        do i = blockIndexToColorIndex(blockIndex-1)+1, &
            blockIndexToColorIndex(blockIndex)
          ! do i = startPos(threadNum, ic), endPos(threadNum, ic)
          iold = perm(i)
          SW1= ZP(3*iold-2)
          SW2= ZP(3*iold-1)
          SW3= ZP(3*iold  )
          isL= indexL(i-1)+1
          ieL= indexL(i)
          do j= isL, ieL
            !k= perm(itemL(j))
            k= itemL(j)
            X1= ZP(3*k-2)
            X2= ZP(3*k-1)
            X3= ZP(3*k  )
            SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3
            SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3
            SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j  )*X3
          enddo ! j

          if (NContact.ne.0) then
            isL= indexCL(i-1)+1
            ieL= indexCL(i)
            do j= isL, ieL
              !k= perm(itemCL(j))
              k= itemCL(j)
              X1= ZP(3*k-2)
              X2= ZP(3*k-1)
              X3= ZP(3*k  )
              SW1= SW1 - CAL(9*j-8)*X1 - CAL(9*j-7)*X2 - CAL(9*j-6)*X3
              SW2= SW2 - CAL(9*j-5)*X1 - CAL(9*j-4)*X2 - CAL(9*j-3)*X3
              SW3= SW3 - CAL(9*j-2)*X1 - CAL(9*j-1)*X2 - CAL(9*j  )*X3
            enddo ! j
          endif

          X1= SW1
          X2= SW2
          X3= SW3
          X2= X2 - ALU(9*i-5)*X1
          X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2
          X3= ALU(9*i  )*  X3
          X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )
          X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)
          ZP(3*iold-2)= X1
          ZP(3*iold-1)= X2
          ZP(3*iold  )= X3
        enddo ! i
      enddo ! blockIndex
    !$omp end do
    enddo ! ic
    
    ...
      !$omp end parallel

OpenACC :

    do ic=1,NColor
    
     !$acc parallel loop collapse(2)
      do blockIndex = icToBlockIndex(ic-1)+1, icToBlockIndex(ic)
        do i = blockIndexToColorIndex(blockIndex-1)+1, &
            blockIndexToColorIndex(blockIndex)
          ! do i = startPos(threadNum, ic), endPos(threadNum, ic)
          iold = perm(i)
          SW1= ZP(3*iold-2)
          SW2= ZP(3*iold-1)
          SW3= ZP(3*iold  )
          isL= indexL(i-1)+1
          ieL= indexL(i)

          !$acc loop vector
          do j= isL, ieL
            !k= perm(itemL(j))
            k= itemL(j)
            X1= ZP(3*k-2)
            X2= ZP(3*k-1)
            X3= ZP(3*k  )
            SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3
            SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3
            SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j  )*X3
          enddo ! j

          if (NContact.ne.0) then
            isL= indexCL(i-1)+1
            ieL= indexCL(i)

           !$acc loop vector
            do j= isL, ieL
              !k= perm(itemCL(j))
              k= itemCL(j)
              X1= ZP(3*k-2)
              X2= ZP(3*k-1)
              X3= ZP(3*k  )
              SW1= SW1 - CAL(9*j-8)*X1 - CAL(9*j-7)*X2 - CAL(9*j-6)*X3
              SW2= SW2 - CAL(9*j-5)*X1 - CAL(9*j-4)*X2 - CAL(9*j-3)*X3
              SW3= SW3 - CAL(9*j-2)*X1 - CAL(9*j-1)*X2 - CAL(9*j  )*X3
            enddo ! j
          endif

          X1= SW1
          X2= SW2
          X3= SW3
          X2= X2 - ALU(9*i-5)*X1
          X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2
          X3= ALU(9*i  )*  X3
          X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )
          X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)
          ZP(3*iold-2)= X1
          ZP(3*iold-1)= X2
          ZP(3*iold  )= X3
        enddo ! i
      enddo ! blockIndex

     !$acc end parallel loop
    enddo ! ic

OpenACC version will fail after different amounts of iterations (5-38 typically), there seems to be a race condition. How can this race condition be avoided? This is not something that occurs in the OpenMP version.

Thanks for any ideas!

Best regards,
Olav

Hi Olav,

Since the example is incomplete, it’s difficult to tell exactly why this would occur, but I’ll do my best with the information provided.

  1. What are the compiler feedback messages telling you? Please add the flag “-Minfo=accel” to your compile to enable the messages.

  2. The two outer loops are not legal to collapse given the “i” loop bounds are taken from a look-up array using the index from the “blockIndex” loop. The trip count for all loops associated with the collapse clause must be computable and invariant in all the loops.

  3. ZP does have a potential race condition depending on the “iold” values being used from the “perm” array. If you can guarantee that there’s no overlap, then it should be fine, but without knowing the values from “perm”, I can’t tell. Note that if there is an overlap, the same issue would occur in OpenMP, but given that OpenMP on the CPU uses far fewer threads than a GPU, the race condition may not be encountered.

  4. I’m not seeing any data management in the code. Assuming that you aren’t using data directive higher up, the compiler will need to implicitly add them. But given most of the arrays use computed indices, it’s not going to know how much of the array to copy. The feedback messages should be giving you a warning if this is the case. Also for performance reasons, you’ll want to add a data region to manage the device copy of the arrays before the “ic” loop so the arrays aren’t copied every time the parallel loop is encountered.

Hope this helps,
Mat