Nvfortran: How to create a private data structure with allocatable arguments?

I searched online, but I didn’t find a discussion about it. Therefore, I’m asking it here: how do I create a data structure with an allocatable (or more than one) member that is private to each gang with OPENACC directives? Here is the code (in a file test.F90) that I’d like to have on the GPU (minimal example):

module test_module
implicit none
!
type test_type
  integer                            :: n1, m2,n2, n3
  integer, pointer, dimension(:,:,:) :: dat
end type test_type
!
contains
  subroutine copydat(datain,dataout,m2,n2,m2out)
    !$acc routine vector
    implicit none
    type(test_type), intent(in)    :: datain
    type(test_type), intent(inout) :: dataout
    integer,         intent(in)    :: m2,n2,m2out
    integer                        :: i1,i2,i3, n1,n3
    !
    n1 = datain%n1; n3 = datain%n3
    do i3 = 1,n3
      do i2 = m2,n2
        !$acc loop vector
        do i1 = 1,n1
          dataout%dat(i1,i2-m2+m2out,i3) = datain%dat(i1,i2,i3)
        end do
      end do
    end do
  end subroutine copydat
  !
  subroutine computedat(data0,m2,n2)
    !$acc routine vector
    implicit none
    type(test_type), intent(inout) :: data0
    integer,         intent(in)    :: m2,n2
    integer                        :: i1,i2,i3
    integer                        :: n1, n3
    !
    n1 = data0%n1; n3 = data0%n3
    do i3 = 1,n3
      do i2 = m2,n2
        !$acc loop vector
        do i1 = 1,n1
          data0%dat(i1,i2,i3) = data0%dat(i1,i2,i3) + 15
        end do
      end do
    end do
  end subroutine computedat
  !
  subroutine get_indices(n2,i_chunk,n_chunks,m2c,n2c)
    !$acc routine seq
    implicit none
    integer, intent(in)  :: n2, i_chunk, n_chunks
    integer, intent(out) :: m2c,n2c
    !
    m2c = 1 + (i_chunk - 1)*n2/n_chunks
    n2c = i_chunk*n2/n_chunks
  end subroutine get_indices
  !
  subroutine deldat(data0)
    implicit none
    type(test_type), intent(inout)  :: data0
    !
    !$acc exit data delete(data0%dat)
    !$acc exit data delete(data0)
    deallocate(data0%dat)
  end subroutine deldat
  !
  subroutine makedat(data0,n1,m2,n2,n3)
    implicit none
    type(test_type), intent(inout) :: data0
    integer, intent(in)            :: n1, m2,n2, n3
    !
    data0%m2 = m2; data0%n2 = n2
    data0%n1 = n1; data0%n3 = n3
    allocate(data0%dat(n1,m2:n2,n3))
    !$acc enter data copyin(data0)
    !$acc enter data create(data0%dat)
  end subroutine makedat
end module test_module

program main
use test_module
use openacc
implicit none
integer         :: n_chunks, i_chunk, n1, n2, n3, i1, i2, i3, m2c,n2c
type(test_type) :: data1, data2, data3

open(10,file='num.txt')
read(10,*) n1
read(10,*) n2
read(10,*) n3
read(10,*) n_chunks
close(10)

write(*,'(4(A,I3),A)') 'Data of size ', n1 , ', ', n2, ', ', n3, ', splitted in ', n_chunks, ' chunks along dim2'

call makedat(data1, n1, 1,n2, n3)
do i3 = 1,n3
  do i2 = 1,n2
    do i1 = 1,n1
      data1%dat(i1,i2,i3) = i1 + (i2-1)*n1 + (i3-1)*n1*n2
    end do
  end do
end do
!$ACC UPDATE DEVICE(data1%dat)

call makedat(data2, n1, 1,n2, n3)

call get_indices(n2,1,n_chunks,m2c,n2c)
call makedat(data3, n1, m2c,n2c, n3)

!$ACC PARALLEL PRIVATE(m2c,n2c) DEFAULT(present) FIRSTPRIVATE(data3)

!print *, "#:",  __pgi_gangidx(), data3%n1, data3%m2, data3%n2

!$ACC LOOP GANG INDEPENDENT
do i_chunk=1,n_chunks
  ! --- Current chunk indices ---
  call get_indices(n2,i_chunk,n_chunks,m2c,n2c)
  call copydat(data1,data3,m2c,n2c,data3%m2)
  call computedat(data3,data3%m2,data3%n2)
  call copydat(data3,data2,data3%m2,data3%n2,m2c)
  ! TEST THAT WORKS
  !call copydat(data1,data2,m2c,n2c,m2c)
  !call computedat(data2,m2c,n2c)
end do
!$ACC END LOOP

!$ACC END PARALLEL

!$ACC UPDATE SELF(data2%dat)

print *, maxval(abs(data1%dat-data2%dat(1:n1,:,:)))

call deldat(data1)
call deldat(data2)
call deldat(data3)

end program main

If I use the calls

call copydat(data1,data2,m2c,n2c,m2c)
call computedat(data2,m2c,n2c)

i.e., each gang access some data block from a global shared array, it works as expected. However, I’d like to create a private data structure data3, and operate directly on it, with one data structure for each gang. I tried with the firstprivate clause, and every gang has its how copy with the correct private indexes data3%n1, data3%m2, etc. However, if I try to access any value of data3%dat, I get a crash with

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Moreover, during compilation I noticed that the compiler says “CUDA shared memory used for data3”. Is it correct?

I am compiling the code with nvfortran -acc -gpu=cc60 -fast -Minfo=inline,accel -Minline=copydat test.F90 on a Tesla P100 GPU (in nums.txt I have something like 128 128 64 32, but other values can be used as well).

Any helps would be extremely helpful.
Fabio

H Fabio,

While “data3” is getting privatized, it gets shallowed copied, meaning that the host addresses for “dat” get copied to the device. Hence the program is accessing host addresses on the device.

You can use CUDA managed memory via the “-gpu=managed” flag where allocated memory is implicitly managed and accessible on both the host and device. However this memory is not private, i.e. all the private “data3”'s “dat” pointers will point to the same memory. Not what you want.

If you can, you’ll want to use a private integer “dat” array, not within “data3” structure.

-Mat

Hi Mat,

thank you very much for your reply. Unfortunately, I really need to use the data structure, as I’m porting a large Fortran code (more than 50k lines) to GPU, and the whole logic of the code is to have data structures containing the 3D grid, the different physical quantities (density, velocities, magnetic field, etc.) as members of the data structures (either as allocatable arrays or as pointers, this can be chosen on compilation time), and the indexes on which to perform computations. Then we give the data structure as input of many different routines, and each routine performs its computation on the different members of the data structures, operating on the indexes (each data member has extension (m1:n1,m2:n2,m3:n3) and we have nested loops over the indexes m1:n1,m2:n2,m3:n3). To change this logic, I would have to change all the interfaces of almost every routine in the code, I prefer to avoid doing it.

Following your reply, I realized that in principle I can use the data3%dat pointers to point to different parts of a bigger array.

My solution is to add to the main program:

program main
...
integer, allocatable, dimension(:,:,:,:), target :: q
integer, pointer, dimension(:,:,:) :: p
...
call get_indices(n2,1,n_chunks,m2c,n2c)
allocate(q(n1,n2c-m2c+1,n3,n_chunks))
...
!$ACC PARALLEL FIRSTPRIVATE(m2c,n2c) DEFAULT(present) CREATE(q) PRIVATE(data3)
data3%n1 = n1; data3%n3 = n3
data3%m2 = 1; data3%n2 = n2c-m2c+1
!$ACC LOOP GANG INDEPENDENT PRIVATE(p)
do i_chunk=1,n_chunks
  !
  p => q(:,:,:,i_chunk)
  data3%dat => p
  ! --- Current chunk indices ---
  call get_indices(n2,i_chunk,n_chunks,m2c,n2c)
  call copydat(data1,data3,m2c,n2c,1)
  call computedat(data3,data3%m2,data3%n2)
  call copydat(data3,data2,data3%m2,data3%n2,m2c)
  nullify(data3%dat)
end do
!$ACC END LOOP
!$ACC END PARALLEL
!$ACC UPDATE SELF(data2%dat)

deallocate(q)
...
end program main

This seems to work as needed. The drawback is that I have to allocate an array q of dimension (:,:,:,n_chunks). If n_chunks > num_gangs (where num_gangs is set on runtime), I’m allocating more memory than needed.

Moreover, I still have a question: I tried to have a smaller q (with 3 dimensions, not 4 as above) and allocate it with
allocate(q(n1,n2c-m2c+1,n3)) and then have it private on the loop, but it doesn’t seem to create a private copy of q, it seems to be shared among gangs. In other words, if I have the simplified example

program main
implicit none
integer         :: n_gangs, i_gangs
integer, allocatable, dimension(:), target :: q
integer, pointer, dimension(:) :: p

open(10,file='num.txt')
read(10,*) n_gangs
close(10)

allocate(q(1))

!$ACC PARALLEL LOOP GANG INDEPENDENT PRIVATE(p,q) &
!$ACC NUM_GANGS(n_gangs)
do i_gangs=1,n_gangs
  !
  p => q
  q(:) = i_gangs
  print *, i_gangs, q(1)
  nullify(p)
end do
!$ACC END PARALLEL LOOP
deallocate(q)
end program main

q is not private to each gang, and q(1) has the same value for all gangs. If I just comment out p => q, q is private and I get the expected result. But here I’m not even using p, why should it change the result if p is targeting q (in which case q is not private) or not (in which case q is private)? Could you explain such a behaviour?

Fabio

Yes, that should work. You’re basically manual privatizing “data3%dat”. As far as memory usage, it would be similar to that if the compiler could to do, the only difference is that this is one per chunk versus one per gang. if num_gangs == n_chunks, the the memory use would be the same either way.

Could you explain such a behaviour?

It appears to me that given “p” isn’t used, dead-code elimination is removing it. But this is somehow causing problems with “q” as well.

The work around would be to use “p”, something like “p(:) = i_gangs” or initialize it before assigning “q”, “p(:)=0”.

I added a problem report, TPR #35494, and sent it to engineering for review.

-Mat