Out of Memory

jbehura07 · October 10, 2023, 3:31am

Hi Mat,

I set up a 2nd machine, with the same exact configuration and the same GPU (Tesla P4). However, I am running into a memory issue as seen in the error message below. The 1st machine is able to execute the code without any issues.
Here is the test code:

module vecaddmod
  implicit none
 contains
  subroutine vecaddgpu( r, a, b, n )
   real, dimension(:) :: r, a, b
   integer :: n
   integer :: i
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
   do i = 1, n
    r(i) = a(i) + b(i)
   enddo
  end subroutine
end module

program main
  use vecaddmod
  implicit none
  integer :: n, i, errs, argcount
  real, dimension(:), allocatable :: a, b, r, e
  character*10 :: arg1
  argcount = command_argument_count()
  n = 1000000000  ! default value
  if( argcount >= 1 )then
   call get_command_argument( 1, arg1 )
   read( arg1, '(i)' ) n
   if( n <= 0 ) n = 100000
  endif
  allocate( a(n), b(n), r(n), e(n) )
  do i = 1, n
   a(i) = i
   b(i) = 1000*i
  enddo
  ! compute on the GPU
  call vecaddgpu( r, a, b, n )
  ! compute on the host to compare
  do i = 1, n
   e(i) = a(i) + b(i)
  enddo
  ! compare results
  errs = 0
  do i = 1, n
   if( r(i) /= e(i) )then
     errs = errs + 1
   endif
  enddo
  print *, errs, ' errors found'
  if( errs ) call exit(errs)
end program

I compile the above code with

nvfortran -acc=gpu -fast -gpu=cc61,cuda12.2,managed -Minfo=accel -stdpar=gpu f1.F90

Thereafter, execution of the code yields the following error:

Out of memory allocating 4000000000 bytes of device memory
Failing in Thread:1
total/free CUDA memory: 7975862272/3861118976
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.1, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7fa5bf74d020 device:0x7fa1d2000000 size:4000000000 presentcount:1+0 line:8 name:a(:n)
allocated block device:0x7fa1d2000000 size:4000000000 thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory
 File: f1.F90
 Function: vecaddgpu:4
 Line: 8

I am at a loss as to what is going on here. Any ideas? I wonder if the unified memory is not working because the same code works fine for a smaller value of n.

Cheers,
Jyoti

MatColgrove · October 10, 2023, 4:44pm

Hi Jyoti,

A P4 has 8GB of memory. Your program has 4 arrays each about 3.75GB, so a P4 can’t fit these in memory. However, CUDA Unified Memory supports oversubscription where the size of the UM pool can be larger than the available device memory. Performance can suffer if data needs to be paged back and forth but at least it should work.

From your output, it appears that the arrays are being explicitly allocated rather than being put in UM. “a” is in the present table but shouldn’t appear there if UM was enabled. It then fails when allocating the next array, likely “b”.

The question is why is UM disabled? Which unfortunately I don’t know.

You have both “-gpu=managed” and “-stdpar=gpu” on the command line, both of which enable UM. Though, maybe double check that you are indeed compiling with these flags?

What’s the CUDA Driver version? Maybe there’s a mismatch and you should remove the “cuda12.2” option?

Sorry that I can’t be more helpful,
Mat

jbehura07 · October 10, 2023, 6:33pm

Hi Mat,

This statement prompted me to try something different.

You have both “-gpu=managed” and “-stdpar=gpu” on the command line, both of which enable UM.

I recompiled without the “-stdpar=gpu” flag, and the code worked without a hitch! I assume UM was getting disabled when I used both flags. Strange.

I will recompile and run my codes with the same change and see if that works as well. Will keep you updated.

Driver Version: 535.104.12 CUDA Version: 12.2

Cheers,
Jyoti

MatColgrove · October 10, 2023, 8:02pm

Odd, I’ve used both flags before numerous times so this would be unexpected, though something is going on with the stdpar flag. Though it does seem system specific but what?

One thing to try is changing the code to:

  subroutine vecaddgpu( r, a, b, n )
   real, dimension(:) :: r, a, b
   integer :: n
   integer :: i
!acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
!$acc kernels loop present(r,a,b)
   do i = 1, n
    r(i) = a(i) + b(i)
   enddo
  end subroutine

The “present” clause should cause a runtime error if UM isn’t actually being used. Not sure it will get use closer to the root cause, but might confirm what’s going on.

Also, what compiler version are you using?

I’ve tried to reproduce the error with every release since 22.5, but am using a P100 with an older CUDA driver.

jbehura07 · October 13, 2023, 6:55am

Hi Mat,

Looks like UM is not being used. Here is the error I get:

hostptr=0x7f0009f4d020,stride=1,size=1000000000,eltsize=4,name=a(:n),flags=0x200=present,async=-1,threadid=1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=a(:n) host:0x7f0009f4d020
 file:f1.F90 vecaddgpu line:8

Compiler: nvfortran 23.9-0 64-bit target on x86-64 Linux -tp ivybridge

Cheers,
Jyoti

MatColgrove · October 13, 2023, 3:07pm

Ok, it’s definitely disabling UM when both flags are enabled, but since I can reproduce it, I unfortunately have no idea why. Plus since it works on the other system with as you said the exact same configuration, it’s a mystery.

Given it works as expect when you take of -stdpar and you’re not using DO CONNCURRENT, let’s move forward with that.

Topic		Replies	Views
cuMemAllocManaged returns out of memory with -stdpar=gpu nvc, nvc++ and nvfortran	4	754	February 6, 2023
Performance Issue / End of Program Dump using Stdpar nvc, nvc++ and nvfortran gpu-computing	3	139	October 10, 2024
Error using StdPar with Managed memory nvc, nvc++ and nvfortran	1	74	December 19, 2024
Issue when using PGI unified memory Legacy PGI Compilers (archived)	3	2942	July 13, 2016
Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100 CUDA Programming and Performance	23	3451	June 25, 2017
Illegal memory access with unified memory CUDA Programming and Performance cuda	4	816	June 13, 2023
Question about unified memory in cuda fortran Legacy PGI Compilers (archived)	3	3518	November 20, 2017
Unified memory oversubscription and page faults CUDA Programming and Performance	7	3009	March 23, 2018
Unified Memory Problem nvc, nvc++ and nvfortran	12	1400	January 12, 2022
Accelerator Error: Unified Memory not supported for OpenMP nvc, nvc++ and nvfortran	1	157	July 25, 2024

Out of Memory

Related topics