Out of range error with openmp gpu offload

Hello,

I’ve been having trouble with an array reduction in a loop that I wanted to gpu-offload with openmp. I’m attaching the code.
I’m using nvfortran from nvhpc/22.7 on a A100 gpu and unified memory. My compilation line is: nvfortran -pg -mp=gpu -gpu=cc80,managed -Minfo=mp test.f90 -o test_omp

And the error with cuda-memcheck that I’m seeing is something like:
========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
========= at 0x000000c0 in __cuda_syscall_mc_dyn_globallock_check
========= by thread (0,0,0) in block (1,0,0)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so [0x2ea4a1]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so [0x42f19]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so [0x3f021]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so [0x3ed23]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so [0x224bc]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so [0x1e8e9]
========= Host Frame:/opt/nvidia/hpc_sdk/Linux_x86_64/22.7/compilers/lib/libnvomp.so (__nvomp_target + 0xf9) [0x1dbe9]

Any help is appreciated
Thanks,
Oscar

test.f90 (1.8 KB)

`

Hi Oscar,

Not unexpected here. In order to perform the reduction, each thread will get it’s own private copy of the entire array. Given “c” is quite large, the code is running out of memory. Better to use atomics for these cases.

Though given each value of “kk” is unique, there’s no collisions so you don’t need either a reduction or an atomic. Just remove “reduction(+:c)”.

-Mat

Hi Mat,
Thanks for your help. Got it, yes you are right! I was extracting this piece of problem from a bigger project. But the piece of code where a reduction is needed will be something like:
!$omp target teams distribute parallel do collapse(2) reduction(+:c)
do j=1,n
do i=1,n
kk = (j-1)*n+i
do k=1,mm
c(k) = c(k) + a(kk) * b(kk)
enddo
enddo
enddo

So atomics will be the only option,right?
Thanks!

In this case the size of “c” is only 10 elements, correct? In that case a reduction should be fine since it’s small. An atomic here would most likely be bad since there would a significant amount of collisions.

In the first case c is (513x513)x10, significantly larger and therefor problematic when each thread need to create it’s own copy to hold the partial reduction.

Yes “c” is only 10 elements, but using reduction gives me the same error of Out-of-range. With atomic there is no problem.
Is this expected or could be something else?
Thanks Mat!

Attatching the test code.
test.f90 (1.7 KB)

It’s actually failing for a different reason, “n”.

Adding “n” to a firstprivate clause, along with privatizing “kk”, and “k” works around the issue.

Alternatively, you can use the “loop” directive or OpenACC which allows for better analysis during compilation where the compiler is most likely able to auto-detect n should be first private.

% cat test.F90
module var_rtable
   real*8, dimension(:), allocatable :: a, b
   real*8, dimension(:), allocatable :: c
   integer :: n
end module var_rtable

program matrix_multiply
   use omp_lib
   use var_rtable
   implicit none
   integer :: i, j, k, kk, myid, m, mm, compiled_for, option
   integer, parameter :: fd = 11
   integer :: t1, t2, dt, count_rate, count_max
   real :: tmp, secs, acumm

   open(fd,file='wallclocktime',form='formatted')

   option = compiled_for(fd) ! OpenMP

   call system_clock(count_max=count_max, count_rate=count_rate)

   call system_clock(t1)
   mm = 10
   m = 1
   n = 513
!   n = 350
   acumm = 0
   allocate( a(n*n), b(n*n), c(mm) )

! Initialize matrices
   do j=1,n
      do i=1,n
         kk = (j-1)*n+i
         a(kk) = 1.0
         b(kk) = 2.0
         do k=1,mm
           c(k) = 0.0
         enddo
      enddo
   enddo

   call test_sub_red_omp(mm)

   print*,"Answer should be 526338 and the test gives ",c(1)
   call system_clock(t2)
   dt = t2-t1
   secs = real(dt)/real(count_rate)
   write(fd,"('For n=',i4,', wall clock time is ',f12.10,' seconds')") &
           n, secs

   deallocate(a, b, c)
   close(fd)

contains

  subroutine test_sub_red_omp(mm)
    use var_rtable
    integer, intent(in) ::  mm
    integer :: i, j, k, l, kk


#ifdef _OPENACC
      !$acc parallel loop gang vector collapse(2) reduction(+:c)
#elif defined(USE_LOOP)
      !$omp target teams loop collapse(2) reduction(+:c) bind(teams,parallel)
#else
      !$omp target teams distribute parallel do collapse(2) reduction(+:c) private(kk,k) firstprivate(n)
#endif
      do j=1,n
         do i=1,n
            kk = (j-1)*n+i
            do k=1,mm
              c(k) = c(k) + a(kk) * b(kk)
            enddo
         enddo
      enddo

  end subroutine test_sub_red_omp

end program matrix_multiply

integer function compiled_for(fd)
implicit none
integer :: fd
  compiled_for = 3
  write(fd,"('This code is compiled with OpenMP')")

end function compiled_for


% nvfortran -acc=gpu -Minfo=accel test.F90 -gpu=managed ; a.out
test_sub_red_omp:
     63, Generating copy(c(:)) [if not already present]
         Generating implicit copyin(b(:)) [if not already present]
         Generating NVIDIA GPU code
         69, !$acc loop gang, vector(128) collapse(2) ! blockidx%x threadidx%x
             Generating reduction(+:c(:))
         70,   ! blockidx%x threadidx%x collapsed
         72, !$acc loop seq
     63, Generating implicit copyin(a(:)) [if not already present]
     70, Generating implicit private(kk)
     72, Loop is parallelizable
 Answer should be 526338 and the test gives     526338.0000000000
% nvfortran -mp=gpu -Minfo=mp test.F90 -gpu=managed -DUSE_LOOP ; a.out
test_sub_red_omp:
     65, !$omp target teams loop
         65, Generating "nvkernel_matrix_multiply_test_sub_red_omp_F1L65_2" GPU kernel
             Generating NVIDIA GPU code
           69, Loop parallelized across teams, threads(128) collapse(2) ! blockidx%x threadidx%x
           70,   ! blockidx%x threadidx%x collapsed
           72, Loop run sequentially
           69, Generating reduction(+:c(:))
         65, Generating Multicore code
           69, Loop parallelized across threads
     70, Generating implicit private(kk)
     72, Loop is parallelizable
 Answer should be 526338 and the test gives     526338.0000000000
% nvfortran -mp=gpu -Minfo=mp test.F90 -gpu=managed  ; a.out
test_sub_red_omp:
     67, !$omp target teams distribute parallel do
         67, Generating "nvkernel_matrix_multiply_test_sub_red_omp_F1L67_2" GPU kernel
 Answer should be 526338 and the test gives     526338.0000000000

-Mat

Nice! Thanks fo trying this out.
It works although from my side if you run it several times the default option “!$omp target teams distribute parallel do collapse(2) reduction(+:c) private(kk,k) firstprivate(n)” still gives the same error. Is totally random, sometimes it works and another times gives the out-of-range error. Is just me or that is still buggy?
Thanks!

Odd. I ran the code 100 times in a row and saw no issue. Only when I lowered the opt level to -O0 or -O1 did I start to see the non-deterministic behavior. I tired adding “i” and “j” to the private clause, and “mm” to firstprivate but it didn’t change the behavior. Though parallel loop index variables are private by default and “mm” is just a loop bounds so can be shared so wouldn’t have expected it to.

Can you try adding “-O2” to your compile line to see if the behavior goes away for you as well?

Though our support for array reductions is relatively new so it could be a compiler issue. I’ll ask one of our engineers to take a look as well as double check that I’m not missing anything.

In general, we recommend using “loop” given to it’s more suited for GPU programming and allows the compiler to perform better optimization. You can use it as a work around but may be a better solution anyway. The only caveat to “loop” is that it’s not yet as well supported by other compilers.

I talked with the engineer that implemented array reductions. I thought it was a shared memory issue but it’s actually a heap overflow. Here’s his explanation:

The array in reduction clause is an allocatable array. Allocatable arrays cannot go into shared/local memory unless we know the size. When doing reductions on arrays, we need to privatize. This means that each block and thread gets a private copy. Since we do not know the size, we dynamically allocate on device.

The cuda heap limit is 8MB. In presence of optimizations like O2, we make more optimal code where we don’t hit the limit. At O0, we do hit the limit.

The workaround is to make array non-allocatable. Or use the cuda API to increase heap limit.

To increase the heap size, you can use the environment variable “NV_ACC_CUDA_HEAPSIZE”. I set it to 32MB and it worked, albeit very slow.

% nvfortran -mp=gpu test.F90 -gpu=cc80,managed -O0 ; a.out
Fatal error: expression 'HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))' (value 1) is not equal to expression 'HX_SUCCES
Abort
% setenv NV_ACC_CUDA_HEAPSIZE 32MB
% a.out
 Answer should be 526338 and the test gives     526338.0000000000

Hello!
Yes, I confirm that with -O2 everything works fine. With -O0 or -O1 there is the problem.
Thank you for the info about shared memory and heap overflow. I guess I will have to test the different approaches that you have suggested (loop and increasing heap size) to see which one is more faster for my application.
Thanks again!

Hi Mat,
My apologies if I come back with a new test(smilar to a project that I have) but I have been doing this to identify the correct way of doing reductions that gives the right answer.
With this example (attached) just the test with “!$omp target teams distribute” and “NV_ACC_CUDA_HEAPSIZE=32MB” works and the others not.
Although is a slightly different example and maybe I’m doing something wrong, but any help will be appreciated.
Thanks,
Oscar
test.f90 (2.1 KB)