Hi CNJ,
For Q1, yes this will work but it’s a bit of pain to manage. OpenACC doesn’t yet support deep copies/updates so you need to manage it all yourself and it’s easy to make mistakes. I wrote and example below.
Deep copy is being actively worked on, but it will be awhile before it’s implemented. What I would suggest in the meantime, is to start with using CUDA Unified Memory, which is enabled in PGI OpenACC via the flag “-ta=managed”.
It has several caveats most notable that it only works for dynamic data, performance can be poor if you access the data back and forth on the host/device, and you’re limited to the amount of memory on your device. It’s also considered a Beta feature and only available on Linux. But it does make dealing with these large complicated data structures a lot easier.
See: Account Login | PGI
For Q2, you can use either create or copyin. The difference being that copyin will create the data and then perform a shallow copy of the data. If you use just create, the data is uninitialized.
Note that since copyin and the update directive perform a shallow copy, using copyin or update host of “Matrix” the host pointer will be copied. So be careful.
The “create” or “copyin” of the allocatable/pointer data members will also perform an “attach”, where device address of the data member gets set in the device copy of the structure.
Hope this helps,
Mat
% cat test2.f90
program foo
TYPE CSR_MATRIX
SEQUENCE
INTEGER :: entry_num, row_num, col_num
REAL(8), ALLOCATABLE :: entry(:)
INTEGER, ALLOCATABLE :: col_idx(:)
INTEGER, ALLOCATABLE :: row_ptr(:)
END TYPE
integer :: i,j
integer,parameter :: N = 100
TYPE(CSR_MATRIX) :: Matrix(N)
!$acc enter data create(Matrix)
do i=1,N
allocate(Matrix(i)%entry(N))
allocate(Matrix(i)%col_idx(N))
allocate(Matrix(i)%row_ptr(N))
Matrix(i)%entry_num = i
Matrix(i)%row_num = i
Matrix(i)%col_num = 1
!$ACC UPDATE device(Matrix(i)%entry_num,Matrix(i)%row_num,Matrix(i)%col_num)
!$ACC ENTER DATA COPYIN(Matrix(i)%entry, Matrix(i)%col_idx, Matrix(i)%row_ptr)
end do
!$acc parallel loop present(Matrix)
do j=1,N
do i=1,N
Matrix(j)%entry(i) = real(i+j) / real(N+N)
Matrix(j)%col_idx(i) = Matrix(j)%col_num + i
Matrix(j)%row_ptr(i) = Matrix(j)%row_num + i
end do
end do
#ifdef _OPENACC
do i=1,N
!$ACC update host (Matrix(i)%entry(1:N), Matrix(i)%col_idx(1:N), Matrix(i)%row_ptr(1:N))
end do
#endif
print *, Matrix(21)%entry(99)
print *, Matrix(15)%col_idx(3)
print *, Matrix(67)%row_ptr(97)
do i=1,N
!$ACC EXIT DATA delete(Matrix(i)%entry, Matrix(i)%col_idx, Matrix(i)%row_ptr)
deallocate(Matrix(i)%entry)
deallocate(Matrix(i)%col_idx)
deallocate(Matrix(i)%row_ptr)
enddo
!$acc exit data delete(Matrix)
end program foo
% pgf90 test2.f90 -Mpreprocess -acc -Minfo=accel -V15.10; a.out
foo:
15, Generating enter data create(matrix(:))
23, Generating update device(matrix%col_num,matrix%row_num,matrix%entry_num)
24, Generating enter data copyin(matrix%row_ptr(:),matrix%col_idx(:),matrix%entry(:))
27, Generating present(matrix(:))
Accelerator kernel generated
Generating Tesla code
28, !$acc loop gang ! blockidx%x
29, !$acc loop vector(128) ! threadidx%x
Loop is parallelizable
38, Generating update host(matrix%row_ptr(1:100),matrix%col_idx(1:100),matrix%entry(1:100))
45, Generating exit data delete(matrix%row_ptr(:),matrix%col_idx(:),matrix%entry(:))
50, Generating exit data delete(matrix(:))
0.5999999642372131
4
164
Simplified to use CUDA Unified Memory:
% cat test2u.f90
program foo
TYPE CSR_MATRIX
SEQUENCE
INTEGER :: entry_num, row_num, col_num
REAL(8), ALLOCATABLE :: entry(:)
INTEGER, ALLOCATABLE :: col_idx(:)
INTEGER, ALLOCATABLE :: row_ptr(:)
END TYPE
integer :: i,j
integer,parameter :: N = 100
TYPE(CSR_MATRIX), allocatable, dimension(:) :: Matrix
allocate(Matrix(N))
do i=1,N
allocate(Matrix(i)%entry(N))
allocate(Matrix(i)%col_idx(N))
allocate(Matrix(i)%row_ptr(N))
Matrix(i)%entry_num = i
Matrix(i)%row_num = i
Matrix(i)%col_num = 1
end do
!$acc parallel loop
do j=1,N
do i=1,N
Matrix(j)%entry(i) = real(i+j) / real(N+N)
Matrix(j)%col_idx(i) = Matrix(j)%col_num + i
Matrix(j)%row_ptr(i) = Matrix(j)%row_num + i
end do
end do
print *, Matrix(21)%entry(99)
print *, Matrix(15)%col_idx(3)
print *, Matrix(67)%row_ptr(97)
do i=1,N
deallocate(Matrix(i)%entry)
deallocate(Matrix(i)%col_idx)
deallocate(Matrix(i)%row_ptr)
enddo
deallocate(Matrix)
end program foo
% pgf90 test2u.f90 -Mpreprocess -acc -Minfo=accel -V15.10 -ta=tesla:managed ; a.out
foo:
25, Accelerator kernel generated
Generating Tesla code
26, !$acc loop gang ! blockidx%x
27, !$acc loop vector(128) ! threadidx%x
25, Generating copy(matrix(:))
27, Loop is parallelizable
0.5999999642372131
4
164