I’m trying to investigate some performance problems with a kernel using openmp offload in fortran I’ve been working on and I’ve noticed that in certain situations private arrays can kill performance. For instance this kernel that does nothing much but as private arrays:
MODULE DECLARATION
IMPLICIT NONE
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_A
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_B
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_C
END MODULE DECLARATION
program prototype
use omp_lib
use DECLARATION
implicit none
INTEGER :: I,J,N, NK
double precision :: tbeg, tend
character (len=10) :: input_NK
integer :: istat
call get_command_argument(1, input_NK)
read (input_NK, '(i9)', iostat=istat) NK
ALLOCATE(IMPRIVATE_A(NK))
ALLOCATE(IMPRIVATE_B(NK))
ALLOCATE(IMPRIVATE_C(NK))
N = 1e6
J = 0
tbeg = omp_get_wtime()
!$OMP target teams distribute parallel do private(IMPRIVATE_A,IMPRIVATE_B,IMPRIVATE_C,I,J)
DO I=1,N;
J = I
END DO
!$OMP end target teams distribute parallel do
tend = omp_get_wtime()
write(*, *) "NK: ", NK
write (*, '(a35, es15.7)') "TIME (s): ", tend - tbeg
END program prototype
If I pass in 200 as the number of array elements the time explodes but at say 400 it drops back down:
Also the addition of more private arrays is worse than linear.
If I take the original arrays out of the module the sensitivity to array size goes away:
Any idea what is going on here?