OpenACC Fortran Derived Types with Pointer Elements

When copying a derive type variable with pointers, we need additional work.
For example,

TYPE CSR_MATRIX
SEQUENCE
INTEGER :: entry_num, row_num, col_num
REAL(8), POINTER :: entry(:)
INTEGER, POINTER :: col_idx(:)
INTEGER, POINTER :: row_ptr(:)
END TYPE

TYPE(CSR_MATRIX) :: Matrix

// Pointer elements are allocated in appropriate way //

!$ACC ENTER DATA COPYIN(Matrix, Matrix%entry, Matrix%col_idx, Matrix%row_ptr)

This will work.

Q1.
If ‘Matrix’ is an array of derived type, namely
TYPE(CSR_MATRIX) :: Matrix(100)

then,

!$ACC ENTER DATA COPYIN(Matrix)
DO i = 1, 100
!$ACC ENTER DATA COPYIN(Matrix(i)%entry, Matrix(i)%col_idx, Matrix(i)%row_ptr)
ENDDO

does this work?

Q2.
If I just allocate(create) - instead of copyin - the ‘Matrix’ on device, namely,

!$ACC ENTER DATA CREATE(Matrix)

do I have to do the following?

!$ACC ENTER DATA CREATE(Matrix%entry, Matrix%col_idx, Matrix%row_ptr)

Hi CNJ,

For Q1, yes this will work but it’s a bit of pain to manage. OpenACC doesn’t yet support deep copies/updates so you need to manage it all yourself and it’s easy to make mistakes. I wrote and example below.

Deep copy is being actively worked on, but it will be awhile before it’s implemented. What I would suggest in the meantime, is to start with using CUDA Unified Memory, which is enabled in PGI OpenACC via the flag “-ta=managed”.

It has several caveats most notable that it only works for dynamic data, performance can be poor if you access the data back and forth on the host/device, and you’re limited to the amount of memory on your device. It’s also considered a Beta feature and only available on Linux. But it does make dealing with these large complicated data structures a lot easier.

See: Account Login | PGI

For Q2, you can use either create or copyin. The difference being that copyin will create the data and then perform a shallow copy of the data. If you use just create, the data is uninitialized.

Note that since copyin and the update directive perform a shallow copy, using copyin or update host of “Matrix” the host pointer will be copied. So be careful.

The “create” or “copyin” of the allocatable/pointer data members will also perform an “attach”, where device address of the data member gets set in the device copy of the structure.

Hope this helps,
Mat

% cat test2.f90
program foo

 TYPE CSR_MATRIX
 SEQUENCE
 INTEGER :: entry_num, row_num, col_num
 REAL(8), ALLOCATABLE :: entry(:)
 INTEGER, ALLOCATABLE :: col_idx(:)
 INTEGER, ALLOCATABLE :: row_ptr(:)
 END TYPE

 integer :: i,j
 integer,parameter :: N = 100
 TYPE(CSR_MATRIX) :: Matrix(N)

!$acc enter data create(Matrix)
 do i=1,N
   allocate(Matrix(i)%entry(N))
   allocate(Matrix(i)%col_idx(N))
   allocate(Matrix(i)%row_ptr(N))
   Matrix(i)%entry_num = i
   Matrix(i)%row_num = i
   Matrix(i)%col_num = 1
!$ACC UPDATE device(Matrix(i)%entry_num,Matrix(i)%row_num,Matrix(i)%col_num)
!$ACC ENTER DATA COPYIN(Matrix(i)%entry, Matrix(i)%col_idx, Matrix(i)%row_ptr)
 end do

!$acc parallel loop present(Matrix)
 do j=1,N
 do i=1,N
   Matrix(j)%entry(i) = real(i+j) / real(N+N)
   Matrix(j)%col_idx(i) = Matrix(j)%col_num + i
   Matrix(j)%row_ptr(i) = Matrix(j)%row_num + i
 end do
 end do

#ifdef _OPENACC
 do i=1,N
!$ACC update host (Matrix(i)%entry(1:N), Matrix(i)%col_idx(1:N), Matrix(i)%row_ptr(1:N))
 end do
#endif
 print *, Matrix(21)%entry(99)
 print *, Matrix(15)%col_idx(3)
 print *, Matrix(67)%row_ptr(97)
 do i=1,N
!$ACC EXIT DATA delete(Matrix(i)%entry, Matrix(i)%col_idx, Matrix(i)%row_ptr)
    deallocate(Matrix(i)%entry)
    deallocate(Matrix(i)%col_idx)
    deallocate(Matrix(i)%row_ptr)
 enddo
!$acc exit data delete(Matrix)

end program foo

% pgf90 test2.f90 -Mpreprocess -acc -Minfo=accel -V15.10; a.out
foo:
     15, Generating enter data create(matrix(:))
     23, Generating update device(matrix%col_num,matrix%row_num,matrix%entry_num)
     24, Generating enter data copyin(matrix%row_ptr(:),matrix%col_idx(:),matrix%entry(:))
     27, Generating present(matrix(:))
         Accelerator kernel generated
         Generating Tesla code
         28, !$acc loop gang ! blockidx%x
         29, !$acc loop vector(128) ! threadidx%x
         Loop is parallelizable
     38, Generating update host(matrix%row_ptr(1:100),matrix%col_idx(1:100),matrix%entry(1:100))
     45, Generating exit data delete(matrix%row_ptr(:),matrix%col_idx(:),matrix%entry(:))
     50, Generating exit data delete(matrix(:))
   0.5999999642372131
            4
          164

Simplified to use CUDA Unified Memory:

% cat test2u.f90
program foo

 TYPE CSR_MATRIX
 SEQUENCE
 INTEGER :: entry_num, row_num, col_num
 REAL(8), ALLOCATABLE :: entry(:)
 INTEGER, ALLOCATABLE :: col_idx(:)
 INTEGER, ALLOCATABLE :: row_ptr(:)
 END TYPE

 integer :: i,j
 integer,parameter :: N = 100
 TYPE(CSR_MATRIX), allocatable, dimension(:) :: Matrix

 allocate(Matrix(N))
 do i=1,N
   allocate(Matrix(i)%entry(N))
   allocate(Matrix(i)%col_idx(N))
   allocate(Matrix(i)%row_ptr(N))
   Matrix(i)%entry_num = i
   Matrix(i)%row_num = i
   Matrix(i)%col_num = 1
 end do

!$acc parallel loop
 do j=1,N
 do i=1,N
   Matrix(j)%entry(i) = real(i+j) / real(N+N)
   Matrix(j)%col_idx(i) = Matrix(j)%col_num + i
   Matrix(j)%row_ptr(i) = Matrix(j)%row_num + i
 end do
 end do

 print *, Matrix(21)%entry(99)
 print *, Matrix(15)%col_idx(3)
 print *, Matrix(67)%row_ptr(97)
 do i=1,N
    deallocate(Matrix(i)%entry)
    deallocate(Matrix(i)%col_idx)
    deallocate(Matrix(i)%row_ptr)
 enddo
 deallocate(Matrix)

end program foo

% pgf90 test2u.f90 -Mpreprocess -acc -Minfo=accel -V15.10 -ta=tesla:managed ; a.out
foo:
     25, Accelerator kernel generated
         Generating Tesla code
         26, !$acc loop gang ! blockidx%x
         27, !$acc loop vector(128) ! threadidx%x
     25, Generating copy(matrix(:))
     27, Loop is parallelizable
   0.5999999642372131
            4
          164
1 Like

When I try to compile and run the code without managed memory using PGI 16.1 it gives me a runtime error. In fact, this error appears to be the same one keeping me from moving forward with my own OpenACC development. I get the following error:

launch CUDA kernel  file=/home/ajacobs/Codebase/t1_oac_simp/noerror.f90 function=foo line=27 device=0 threadid=1 num_gangs=100 num_workers=1 vector_length=128 grid=100 block=128
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Any idea what has broken here?

Most likely you’re dereferencing a host pointer on the device. Check how and the order that you’re copying in the user defined type. Manual deep copy can be a bit tricky.

  • Mat