OpenMP offloading to GPU: call to cuMemcpyDtoHAsync returned error 700:

Dr_Zee · May 3, 2024, 7:20pm

! This code is for testing GPU offloading @NERSC
PROGRAM CK
USE OMP_LIB
IMPLICIT NONE
call GPU_test()
Contains
SUBROUTINE GPU_test()
implicit none
REAL(kind=8), allocatable :: x(:) ! data … cannot be double complex.
REAL(KIND=8) :: startTime ! for timing
INTEGER(kind=8) :: maxloop=100000000 ! the loop number=100M for testing
INTEGER(kind=8) :: i, j ! loop counter
INTEGER(kind=8) :: N = 1000 ! array size
allocate(x(1:N)); x=0d0
! serial …
x=0d0; startTime= omp_get_wtime()
Do i=1,maxloop !
Do j=1,N
x(j) = x(j)+1D0/maxloop ! every element has the same value.
END DO
END DO
print *, 'CPU calculation time (sec) = ', sngl(omp_get_wtime()-startTime) , sum(x)

 ! threaded ... GPU offloading ....
  x=0d0 ; startTime= omp_get_wtime()
  !$OMP target teams distribute parallel do reduction(+:x) map(tofrom:x)  private(i,j)
  Do i=1,maxloop  ! 
    Do j=1,N     ! 
        x(j) = x(j)+1D0/maxloop                                  ! every element has the same value.   
    END DO
  END DO
  !$OMP end target teams distribute parallel do
  print *, 'OMP calculation time (sec) = ', sngl(omp_get_wtime()-startTime) , sum(x)

END SUBROUTINE GPU_test
END PROGRAM ck

I am testing GPU acceleration with this code which is basically doing an array reduction after huge number of loops. The error I got: “Accelerator Fatal Error: call to cuMemcpyDtoHAsync returned error 700: Illegal address during kernel execution”, or sometimes it is the cuStreamSynchronize error

if I put num_teams(2) in the directive:
!$OMP target teams distribute parallel do num_teams(2) reduction(+:x) private(i,j)
The code works but it is very slow. If any number is larger than 4 in num_teams(), it will give me the same cuStreamSynchronize/cuMemcpyDtoHAsync error.

Could you please advice me on how to fix the error and speed up this simple test using GPU offloading?

Thank you!

MatColgrove · May 3, 2024, 9:27pm

Hi Dr_Zee,

My best guess is that you’re encountering a heap overflow. Each thread needs to allocate it’s own private copy of the reduction array and that allocation is done on the GPU. Hence if the array is large or there’s a significant number of threads, the device heap (which by default is relatively small) can fill up.

Try setting the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value like 64MB.

Note that I tried your example here and it worked on my H100, but the default heap size can vary by device.

-Mat

Dr_Zee · May 4, 2024, 3:46am

Thanks for the reply.
Do you see the speed up when you run the code? Do you have to set num_teams or thread_limit?
I set NV_ACC_CUDA_HEAPSIZE=64MB. The code did run with num_teams(8) but it is still much slower than single thread version. What might be the problem ?

Thanks

MatColgrove · May 6, 2024, 3:44pm

No, but oddly while it “just worked” for me last week, I needed to set NV_ACC_CUDA_HEAPSIZE=2GB to get it to work this morning.

Do you see the speed up when you run the code?

No, nor would I expect it. Besides the heap issue, having every thread needing to allocate on the device their own private copy of the array can be quite expensive. It can help a bit if “X” were a fixed size array, but then given it’s a relatively large array, there’s a lot of overhead to perform the final reduction.

In general, it’s best to avoid array reductions, and if needed, use small fixed size array.

Topic		Replies	Views
Out of range error with openmp gpu offload nvc, nvc++ and nvfortran	10	993	February 1, 2023
CUDA_ERROR_ILLEGAL_ADDRESS with OpenMP "distribute parallel for" nvc, nvc++ and nvfortran	2	238	May 15, 2024
OpenMp Target Map does't work with member variables nvc, nvc++ and nvfortran gpu	2	593	October 23, 2023
OpenMP kernel is too big? nvc, nvc++ and nvfortran	3	673	August 14, 2023
Performance of private arrays declared in a separate module nvc, nvc++ and nvfortran	9	48	February 13, 2025
HX_CU_CALL_CHECK error when using OpenMP offloading with scheduling nvc, nvc++ and nvfortran	3	1413	January 6, 2022
Converting OpenMP from multicore to GPU question nvc, nvc++ and nvfortran	8	761	September 7, 2021
About num_teams value nvc, nvc++ and nvfortran	4	705	January 24, 2023
Runtime error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	3	277	April 30, 2024
Failed to use teams distribute in GPU nvc, nvc++ and nvfortran cuda	7	922	January 27, 2022

OpenMP offloading to GPU: call to cuMemcpyDtoHAsync returned error 700:

Related topics