OpenACC equivalent of OpenMP accellerated code

olavaaf · May 27, 2019, 5:13am

Hi,

I have a code part that parallelizes just fine, where I want it to, and runs correctly with OpenMP, however not with OpenACC. Any advises ?

OpenMP :

      !$omp parallel default(none) &
      !$omp&shared(NColor,indexL,itemL,indexU,itemU,AL,AU,D,ALU,perm,&
      !$omp&       NContact,indexCL,itemCL,indexCU,itemCU,CAL,CAU,&
      !$omp&       ZP,icToBlockIndex,blockIndexToColorIndex) &
      !$omp&private(SW1,SW2,SW3,X1,X2,X3,ic,i,iold,isL,ieL,isU,ieU,j,k,blockIndex)

    !C-- FORWARD
    do ic=1,NColor
    
      !$omp do schedule (static, 1)
      do blockIndex = icToBlockIndex(ic-1)+1, icToBlockIndex(ic)
        do i = blockIndexToColorIndex(blockIndex-1)+1, &
            blockIndexToColorIndex(blockIndex)
          ! do i = startPos(threadNum, ic), endPos(threadNum, ic)
          iold = perm(i)
          SW1= ZP(3*iold-2)
          SW2= ZP(3*iold-1)
          SW3= ZP(3*iold  )
          isL= indexL(i-1)+1
          ieL= indexL(i)
          do j= isL, ieL
            !k= perm(itemL(j))
            k= itemL(j)
            X1= ZP(3*k-2)
            X2= ZP(3*k-1)
            X3= ZP(3*k  )
            SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3
            SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3
            SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j  )*X3
          enddo ! j

          if (NContact.ne.0) then
            isL= indexCL(i-1)+1
            ieL= indexCL(i)
            do j= isL, ieL
              !k= perm(itemCL(j))
              k= itemCL(j)
              X1= ZP(3*k-2)
              X2= ZP(3*k-1)
              X3= ZP(3*k  )
              SW1= SW1 - CAL(9*j-8)*X1 - CAL(9*j-7)*X2 - CAL(9*j-6)*X3
              SW2= SW2 - CAL(9*j-5)*X1 - CAL(9*j-4)*X2 - CAL(9*j-3)*X3
              SW3= SW3 - CAL(9*j-2)*X1 - CAL(9*j-1)*X2 - CAL(9*j  )*X3
            enddo ! j
          endif

          X1= SW1
          X2= SW2
          X3= SW3
          X2= X2 - ALU(9*i-5)*X1
          X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2
          X3= ALU(9*i  )*  X3
          X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )
          X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)
          ZP(3*iold-2)= X1
          ZP(3*iold-1)= X2
          ZP(3*iold  )= X3
        enddo ! i
      enddo ! blockIndex
    !$omp end do
    enddo ! ic
    
    ...
      !$omp end parallel

OpenACC :

    do ic=1,NColor
    
     !$acc parallel loop collapse(2)
      do blockIndex = icToBlockIndex(ic-1)+1, icToBlockIndex(ic)
        do i = blockIndexToColorIndex(blockIndex-1)+1, &
            blockIndexToColorIndex(blockIndex)
          ! do i = startPos(threadNum, ic), endPos(threadNum, ic)
          iold = perm(i)
          SW1= ZP(3*iold-2)
          SW2= ZP(3*iold-1)
          SW3= ZP(3*iold  )
          isL= indexL(i-1)+1
          ieL= indexL(i)

          !$acc loop vector
          do j= isL, ieL
            !k= perm(itemL(j))
            k= itemL(j)
            X1= ZP(3*k-2)
            X2= ZP(3*k-1)
            X3= ZP(3*k  )
            SW1= SW1 - AL(9*j-8)*X1 - AL(9*j-7)*X2 - AL(9*j-6)*X3
            SW2= SW2 - AL(9*j-5)*X1 - AL(9*j-4)*X2 - AL(9*j-3)*X3
            SW3= SW3 - AL(9*j-2)*X1 - AL(9*j-1)*X2 - AL(9*j  )*X3
          enddo ! j

          if (NContact.ne.0) then
            isL= indexCL(i-1)+1
            ieL= indexCL(i)

           !$acc loop vector
            do j= isL, ieL
              !k= perm(itemCL(j))
              k= itemCL(j)
              X1= ZP(3*k-2)
              X2= ZP(3*k-1)
              X3= ZP(3*k  )
              SW1= SW1 - CAL(9*j-8)*X1 - CAL(9*j-7)*X2 - CAL(9*j-6)*X3
              SW2= SW2 - CAL(9*j-5)*X1 - CAL(9*j-4)*X2 - CAL(9*j-3)*X3
              SW3= SW3 - CAL(9*j-2)*X1 - CAL(9*j-1)*X2 - CAL(9*j  )*X3
            enddo ! j
          endif

          X1= SW1
          X2= SW2
          X3= SW3
          X2= X2 - ALU(9*i-5)*X1
          X3= X3 - ALU(9*i-2)*X1 - ALU(9*i-1)*X2
          X3= ALU(9*i  )*  X3
          X2= ALU(9*i-4)*( X2 - ALU(9*i-3)*X3 )
          X1= ALU(9*i-8)*( X1 - ALU(9*i-6)*X3 - ALU(9*i-7)*X2)
          ZP(3*iold-2)= X1
          ZP(3*iold-1)= X2
          ZP(3*iold  )= X3
        enddo ! i
      enddo ! blockIndex

     !$acc end parallel loop
    enddo ! ic

OpenACC version will fail after different amounts of iterations (5-38 typically), there seems to be a race condition. How can this race condition be avoided? This is not something that occurs in the OpenMP version.

Thanks for any ideas!

Best regards,
Olav

MatColgrove · May 28, 2019, 3:59pm

Hi Olav,

Since the example is incomplete, it’s difficult to tell exactly why this would occur, but I’ll do my best with the information provided.

What are the compiler feedback messages telling you? Please add the flag “-Minfo=accel” to your compile to enable the messages.
The two outer loops are not legal to collapse given the “i” loop bounds are taken from a look-up array using the index from the “blockIndex” loop. The trip count for all loops associated with the collapse clause must be computable and invariant in all the loops.
ZP does have a potential race condition depending on the “iold” values being used from the “perm” array. If you can guarantee that there’s no overlap, then it should be fine, but without knowing the values from “perm”, I can’t tell. Note that if there is an overlap, the same issue would occur in OpenMP, but given that OpenMP on the CPU uses far fewer threads than a GPU, the race condition may not be encountered.
I’m not seeing any data management in the code. Assuming that you aren’t using data directive higher up, the compiler will need to implicitly add them. But given most of the arrays use computed indices, it’s not going to know how much of the array to copy. The feedback messages should be giving you a warning if this is the case. Also for performance reasons, you’ll want to add a data region to manage the device copy of the arrays before the “ic” loop so the arrays aren’t copied every time the parallel loop is encountered.

Hope this helps,
Mat

Topic		Replies	Views
OpenMP and Accelerator directives Legacy PGI Compilers	6	15054	January 4, 2010
Combining OpenMP and OpenACC Legacy PGI Compilers	4	6220	November 14, 2017
OpenACC usage inside OpenMP constructs Legacy PGI Compilers	6	3876	August 26, 2019
combine the OpenMP with the OpenACC Legacy PGI Compilers	5	5443	April 22, 2014
Couple of questions (nested loops, loop bounds, etc.) Legacy PGI Compilers	17	25097	December 11, 2014
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10421	May 2, 2014
OpenACC - Basic Relaxation Method Not Being Accelerated Legacy PGI Compilers	5	5872	November 27, 2015
Nvc++ OpenMP error inside llc nvc, nvc++ and nvfortran	5	1128	June 1, 2021
Nested OpenMP not supported in community edition? Legacy PGI Compilers	16	8603	January 18, 2019
Why my OpenACC code remains slower than OpenMP? Legacy PGI Compilers	3	3950	July 26, 2013

OpenACC equivalent of OpenMP accellerated code

Related topics