OpenACC "threadprivate"?

In OpenACC, is there an equivalent to OpenMP’s threadprivate directive? I.e., is there a way to create loop-private data whose instance and value persists between acc loops?

I’d like to do something like this:

!$acc parallel num_gangs(zmax) vector_length(xmax,ymax)
!$acc loop gang
do k = 1, zmax
  !$acc loop vector collapse(2) private(temp)
  do j = 1, ymax
    do i = 1, xmax
       !$acc loop seq
       do n = 1, nmax
           temp = some complicated accumulation over n
       enddo
    enddo
  enddo
  ! Here, I'm ending the loops to do an implicit synchthreads
  !$acc loop vector collapse(2) private(temp)
  do j = 1, ymax
    do i = 1, xmax
      ! Here, I want each vector to use its previously-accumulated 
      ! instance of temp (not a new instance of temp)
    enddo
  enddo
enddo
!$acc end parallel

Hi Ron,

Sorry, no. In OpenACC, the private variable is only persistent within the loop on which it’s declared as private.

Here, you’ll either need to manually privatize “temp” by making it a shared 3D array, or make it a 2D array private to the gang loop.

-Mat

I reached out to the OpenACC committee about this, and Michael Wolfe gave some really interesting responses.

Among other comments, he pointed out that there can be correctness issues with persistent threadprivate variables in a pragma-based approach. For example, the hypothetical example I showed here might not be correct if the pragmas were ignored.

Hi Ron,

I saw Michael’s response as well. There’s definitely challenges here, but you’re not the only one that’s requested something similar so the OpenACC standards committee is taking a serious look at what can be done.

I’m thinking that in order to solve the immediate problem, you might try making “temp” a 2D private array at the gang level, or making it a 3D shared array. This will allow the code to be correct with or without OpenACC. Though, I’m not sure it will be able to match the performance of your CUDA version. Keep in mind that OpenACC is completely interoperable with CUDA. Hence if you do find a case where your CUDA version is considerably faster, it’s easy to ingrate as part of a larger OpenACC enabled application. Because OpenACC is more generalized to target a wide variety of devices, certain CUDA coding practices may sometimes not be able to be replicated.

For example, 2D private gang array

real(kind=..), dimension(xmax,ymax) :: temp
...
!$acc parallel num_gangs(zmax) vector_length(xmax,ymax) 
!$acc loop gang private(temp)
do k = 1, zmax 
  !$acc loop vector collapse(2) 
  do j = 1, ymax 
    do i = 1, xmax 
       !$acc loop seq 
       do n = 1, nmax 
           temp(i,j) = some complicated accumulation over n 
       enddo 
    enddo 
  enddo 
  ! Here, I'm ending the loops to do an implicit synchthreads 
  !$acc loop vector collapse(2) private(temp) 
  do j = 1, ymax 
    do i = 1, xmax 
       ... = temp(i,j)
      ! Here, I want each vector to use its previously-accumulated 
      ! instance of temp (not a new instance of temp) 
    enddo 
  enddo 
enddo 
!$acc end parallel

Or a 3D shared array.

real(kind=..), dimension(xmax,ymax, zmax) :: temp
...
!$acc parallel num_gangs(zmax) vector_length(xmax,ymax) &
!$acc create(temp)
!$acc loop gang 
do k = 1, zmax 
  !$acc loop vector collapse(2) 
  do j = 1, ymax 
    do i = 1, xmax 
       !$acc loop seq 
       do n = 1, nmax 
           temp(i,j,k) = some complicated accumulation over n 
       enddo 
    enddo 
  enddo 
  ! Here, I'm ending the loops to do an implicit synchthreads 
  !$acc loop vector collapse(2) private(temp) 
  do j = 1, ymax 
    do i = 1, xmax 
       ... = temp(i,j,k)
      ! Here, I want each vector to use its previously-accumulated 
      ! instance of temp (not a new instance of temp) 
    enddo 
  enddo 
enddo 
!$acc end parallel