Performance of private arrays declared in a separate module

I’m trying to investigate some performance problems with a kernel using openmp offload in fortran I’ve been working on and I’ve noticed that in certain situations private arrays can kill performance. For instance this kernel that does nothing much but as private arrays:

MODULE DECLARATION
 
IMPLICIT NONE
 
 
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_A
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_B
double precision,DIMENSION(:),ALLOCATABLE::IMPRIVATE_C
 
END MODULE DECLARATION
 
 
 
program prototype
 
    use omp_lib
    use DECLARATION
 
    implicit none
 
    INTEGER :: I,J,N, NK
 
    double precision :: tbeg, tend
 
    character (len=10) :: input_NK
    integer :: istat
 
 
 
    call get_command_argument(1, input_NK)
    read (input_NK, '(i9)', iostat=istat) NK
 
 
    ALLOCATE(IMPRIVATE_A(NK))
    ALLOCATE(IMPRIVATE_B(NK))
    ALLOCATE(IMPRIVATE_C(NK))
 
    N = 1e6
 
    J = 0
    tbeg = omp_get_wtime()
    !$OMP target teams distribute parallel do private(IMPRIVATE_A,IMPRIVATE_B,IMPRIVATE_C,I,J)
    DO I=1,N;
        J = I
    END DO
    !$OMP end target teams distribute parallel do
    tend = omp_get_wtime()
 
    write(*, *) "NK: ", NK
    write (*, '(a35, es15.7)') "TIME (s): ", tend - tbeg
 
 
END program prototype

If I pass in 200 as the number of array elements the time explodes but at say 400 it drops back down:

Also the addition of more private arrays is worse than linear.

If I take the original arrays out of the module the sensitivity to array size goes away:

Any idea what is going on here?

This is an interesting observation. Do you get the same behavior if you use “omp target loop” instead of “target teams distribute”?

Sorry, I’m away on a business trip this week so didn’t get a chance to look at this in-depth, but I’m wondering if the the larger case is hitting a heap overflow?

The private arrays should be getting allocated on the device per thread and three 400 element arrays will take a lot of memory. The device alloca might be failing but not show-up as an error since they aren’t being accessed. Hence it seems like it’s faster, but is really an error.

Try looping and assigning values to the private arrays to see what happens.

If I add a write to those arrays I don’t see any obvious errors.

    !$OMP target teams distribute parallel do default(none) private(IMPRIVATE_A,IMPRIVATE_B,IMPRIVATE_C,I,J) shared(N,NK)
    DO I=1,N;
        IMPRIVATE_A(1:NK) = I
        IMPRIVATE_B(1:NK) = I
        IMPRIVATE_C(1:NK) = I
    END DO
    !$OMP end target teams distribute parallel do

I am setting NV_ACC_CUDA_HEAPSIZE to say 5GB.

omp target teams loop does make the behaviour disappear:

Also oddly adding in the writes made the parallel do faster but still slow.

So what is the trade off here?

“target teams distribute parallel do” and “target teams loop” come from different iterations of the openmp spec. The first is more explicit whereas the second leaves the parallelism more up to the implementation. Are there more specific differences in the case of nvfortran?

So the trade off is explicit control versus performance, I suppose. We often recommend to users, if they can, to use “target teams loop” rather than “target teams distribute”, because the performance with our compiler is usually better with loop. We are working on fixing performance issues we find with teams distribute clauses though - so your case is interesting and we’re going to look at why that behavior is happening and likely report it as a performance bug to get it fixed, if we can.

Hope that helps!

Thanks. I will try using “target teams loop” generally then.

However, I recall now why I wasn’t using “target teams loop” in the actual code. It is because I was getting a crash at compile time. I’ve narrowed the crashed down to passing global variables in omp kernels (using target teams loop) to subroutines.

And I’ve reduced the problem down to this:

MODULE DECLARATION

IMPLICIT NONE

INTEGER::IMAX

CONTAINS

    SUBROUTINE DOSOMETHING(IM)
    !$OMP declare target
        IMPLICIT NONE

        INTEGER,INTENT(INOUT)::IM

    END SUBROUTINE DOSOMETHING

END MODULE DECLARATION


program prototype

    use DECLARATION

    implicit none

    INTEGER :: I, N

    N = 1e6

    !$OMP target teams loop default(none) shared(N) private(IMAX)
    DO I=1,N;
        CALL DOSOMETHING(IMAX)
    END DO
    !$OMP end target teams loop

END program prototype

This causes the compiler to crash but it doesn’t crash when using “target teams distribute”

$ nvfortran -Minfo -mp=gpu omp_loop_sbcall.f90 -o omp_loop_sbcall.x
dosomething:
      9, Generating device code for declare target routine
         Generating NVIDIA GPU code
nvfortran-Fatal-nvhpc-251/Linux_x86_64/25.1/compilers/bin/tools/fort2 TERMINATED by signal 11


$ nvfortran -Minfo -mp=gpu parallel_do_sbcall.f90 -o parallel_do_sbcall.x
dosomething:
      9, Generating device code for declare target routine
         Generating NVIDIA GPU code
prototype:
     30, !$omp target teams distribute parallel do
         30, Generating "nvkernel_MAIN__F1L30_2" GPU kernel

I found a workaround. Which is to enclose the kernel in a subroutine and pass in the global variables as arguments and use the passed in references in the kernel:

MODULE DECLARATION

IMPLICIT NONE

INTEGER::IMAX

CONTAINS

    SUBROUTINE DOSOMETHING(IM)
    !$OMP declare target
        IMPLICIT NONE

        INTEGER,INTENT(INOUT)::IM

    END SUBROUTINE DOSOMETHING


    SUBROUTINE EXTRAWRAPPER(IMAX_L)
        IMPLICIT NONE

        INTEGER,INTENT(INOUT)::IMAX_L
        INTEGER :: I, N

        N = 1e6
    
        !$OMP target teams loop default(none) shared(N) private(IMAX_L) 
        DO I=1,N;
            CALL DOSOMETHING(IMAX_L)
        END DO
        !$OMP end target teams loop
    
    END SUBROUTINE EXTRAWRAPPER

END MODULE DECLARATION


program prototype

    use DECLARATION

    implicit none

    CALL EXTRAWRAPPER(IMAX)

END program prototype
$ nvfortran -Minfo -mp=gpu omp_loop_sbcall_workaround.f90 -o omp_loop_sbcall_workaround.x
dosomething:
      9, Generating device code for declare target routine
         Generating NVIDIA GPU code
extrawrapper:
     28, !$omp target teams loop
         28, Generating "nvkernel_declaration_extrawrapper__F1L28_2" GPU kernel
             Generating NVIDIA GPU code
           29, Loop parallelized across teams ! blockidx%x
         28, Generating Multicore code
           29, Loop parallelized across threads
$

That’s very interesting behavior! I have reported the target teams loop issue to our engineering team as TPR#37128, hopefully we can get that fixed. We are always appreciative of people bringing us interesting bugs and unhelpful compiler crash behaviors, so thank you for sharing that!

We will also continue investigating the target teams distribute behavior you’ve highlighted. If we come to any interesting findings, we’ll be sure to let you know!

Thanks again!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.