Hello,
In the following code I am having a problem with printing bst_nr%b(1)%minf, while bst_nr%n is printed. What am I doing wrong?
Thanks,
Ilkhom
[ilkhom@t020 MCCC-FN-GPU_DEV]$ module load nvhpc/22.3
[ilkhom@t020 MCCC-FN-GPU_DEV]$ cat test.f90
module bst_data
implicit none
type:: nr
integer :: minf
end type nr
type:: basis
integer:: n
type(nr), allocatable, dimension(:) :: b
end type basis
type(basis) :: bst_nr ! this is Sturmian basis (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data
subroutine acc_rout
!$acc routine seq
use bst_data
implicit none
print*,'bst_nr%n=', bst_nr%n
print*,'bst_nr%b(1)%minf=', bst_nr%b(1)%minf
end subroutine acc_rout
program test
use bst_data
implicit none
integer :: i
bst_nr%n=5
allocate(bst_nr%b(1:bst_nr%n))
do i=1,bst_nr%n
bst_nr%b(i)%minf=2
enddo
!$acc enter data copyin(bst_nr%n)
!$acc enter data copyin(bst_nr%b)
do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%minf)
enddo
!$acc update device(bst_nr)
!$acc kernels
!$acc loop
do i=1,1
call acc_rout
enddo
!$acc end kernels
end program test
[ilkhom@t020 MCCC-FN-GPU_DEV]$ nvfortran -Minfo=accel -acc -ta=tesla:cuda11.6 test.f90 && ./a.out
acc_rout:
14, Generating acc routine seq
Generating NVIDIA GPU code
test:
35, Generating enter data copyin(bst_nr%n)
36, Generating enter data copyin(bst_nr%b(:))
38, Generating enter data copyin(bst_nr%b$p%minf)
40, Generating update device(bst_nr)
44, Loop is parallelizable
Generating NVIDIA GPU code
44, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
bst_nr%n= 5
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
The update directive is the problem. This performs a shallow copy of the UDT so you’re overwriting the device address of “b” with the host address.
!$acc enter data copyin(bst_nr%n)
This should be an update directive. “copyin” will create and copy the data if it’s not present. But since it is present, no action is taken and “n” will remain uninitialized.
Here’s the corrected code:
% cat test.F90
module bst_data
implicit none
type:: nr
integer :: minf
end type nr
type:: basis
integer:: n
type(nr), allocatable, dimension(:) :: b
end type basis
type(basis) :: bst_nr ! this is Sturmian basis (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data
subroutine acc_rout
!$acc routine seq
use bst_data
implicit none
print*,'bst_nr%n=', bst_nr%n
print*,'bst_nr%b(1)%minf=', bst_nr%b(1)%minf
end subroutine acc_rout
program test
use bst_data
implicit none
integer :: i
bst_nr%n=5
allocate(bst_nr%b(1:bst_nr%n))
do i=1,bst_nr%n
bst_nr%b(i)%minf=2
enddo
!$acc update device(bst_nr%n)
!$acc enter data copyin(bst_nr%b)
do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%minf)
enddo
!$acc kernels
!$acc loop
do i=1,1
call acc_rout
enddo
!$acc end kernels
end program test
% nvfortran test.F90 -acc ; a.out
bst_nr%n= 5
bst_nr%b(1)%minf= 2
Thanks, Mat for the prompt reply (as always). Much appreciated! I also noticed that just printing bst_nr%b(i)%f(1:nr) from within the acc routine seq subroutine is very slow. Is it expected behaviour with derived type structures? In my case i~100 and nr~8000. f(:) is a double precision array.
I’m a bit surprised that putting an array in a print statement worked. There’s only basic support for printing from the device and I didn’t think the runtime call for printing arrays was supported. I just tried and see the expect error “Compiler runtime function not supported - pghpfio_ldw64”
In general codes should limit printing from the device. All the output needs to get buffered and copied back to the host to be printed. So yes, I’d expect printing out 800,000 doubles from the device would be slow.
Though the another issue is the access of “f” is non-contiguous for the threads (each thread is accessing a separate array). This will lead to a lot of stalls waiting for memory and cache thrash. If you put the print in an explicit loop and then make the routine “vector”, you can then get the threads to access the array contiguously. Something like:
module bst_data
implicit none
type:: nr
integer :: minf
real(8), allocatable, dimension(:) :: f
end type nr
type:: basis
integer:: n
integer :: nr
type(nr), allocatable, dimension(:) :: b
end type basis
type(basis) :: bst_nr ! this is Sturmian basis (depend on k and l)
!$acc declare create(bst_nr)
end module bst_data
module routines
contains
subroutine acc_rout(i)
!$acc routine vector
use bst_data
implicit none
integer,value :: i
integer :: nrs,j
nrs= bst_nr%nr
!$acc loop vector
do j=1,nrs
print*,'bst_nr%b(i)%f(j)=', bst_nr%b(i)%f(j)
enddo
end subroutine acc_rout
end module routines
program test
use bst_data
use routines
implicit none
integer :: i
bst_nr%n=100
bst_nr%nr = 8000
allocate(bst_nr%b(1:bst_nr%n))
do i=1,bst_nr%n
bst_nr%b(i)%minf=2
allocate(bst_nr%b(i)%f(1:bst_nr%nr))
bst_nr%b(i)%f = 1.0
enddo
!$acc update device(bst_nr%n)
!$acc update device(bst_nr%nr)
!$acc enter data copyin(bst_nr%b)
do i=1,bst_nr%n
!$acc enter data copyin(bst_nr%b(i)%f(:bst_nr%nr))
enddo
!$acc kernels
!$acc loop gang
do i=1,bst_nr%n
call acc_rout(i)
enddo
!$acc end kernels
end program test
Thank you very much for advising an optimal way to work with derived type array. With the changes you suggested I got about ~200 times speedup in the main code.