Out of Memory

Hi Mat,

I set up a 2nd machine, with the same exact configuration and the same GPU (Tesla P4). However, I am running into a memory issue as seen in the error message below. The 1st machine is able to execute the code without any issues.
Here is the test code:

module vecaddmod
  implicit none
 contains
  subroutine vecaddgpu( r, a, b, n )
   real, dimension(:) :: r, a, b
   integer :: n
   integer :: i
!$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
   do i = 1, n
    r(i) = a(i) + b(i)
   enddo
  end subroutine
end module

program main
  use vecaddmod
  implicit none
  integer :: n, i, errs, argcount
  real, dimension(:), allocatable :: a, b, r, e
  character*10 :: arg1
  argcount = command_argument_count()
  n = 1000000000  ! default value
  if( argcount >= 1 )then
   call get_command_argument( 1, arg1 )
   read( arg1, '(i)' ) n
   if( n <= 0 ) n = 100000
  endif
  allocate( a(n), b(n), r(n), e(n) )
  do i = 1, n
   a(i) = i
   b(i) = 1000*i
  enddo
  ! compute on the GPU
  call vecaddgpu( r, a, b, n )
  ! compute on the host to compare
  do i = 1, n
   e(i) = a(i) + b(i)
  enddo
  ! compare results
  errs = 0
  do i = 1, n
   if( r(i) /= e(i) )then
     errs = errs + 1
   endif
  enddo
  print *, errs, ' errors found'
  if( errs ) call exit(errs)
end program

I compile the above code with

nvfortran -acc=gpu -fast -gpu=cc61,cuda12.2,managed -Minfo=accel -stdpar=gpu f1.F90

Thereafter, execution of the code yields the following error:

Out of memory allocating 4000000000 bytes of device memory
Failing in Thread:1
total/free CUDA memory: 7975862272/3861118976
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 6.1, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7fa5bf74d020 device:0x7fa1d2000000 size:4000000000 presentcount:1+0 line:8 name:a(:n)
allocated block device:0x7fa1d2000000 size:4000000000 thread:1
Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory
 File: f1.F90
 Function: vecaddgpu:4
 Line: 8

I am at a loss as to what is going on here. Any ideas? I wonder if the unified memory is not working because the same code works fine for a smaller value of n.

Cheers,
Jyoti

Hi Jyoti,

A P4 has 8GB of memory. Your program has 4 arrays each about 3.75GB, so a P4 can’t fit these in memory. However, CUDA Unified Memory supports oversubscription where the size of the UM pool can be larger than the available device memory. Performance can suffer if data needs to be paged back and forth but at least it should work.

From your output, it appears that the arrays are being explicitly allocated rather than being put in UM. “a” is in the present table but shouldn’t appear there if UM was enabled. It then fails when allocating the next array, likely “b”.

The question is why is UM disabled? Which unfortunately I don’t know.

You have both “-gpu=managed” and “-stdpar=gpu” on the command line, both of which enable UM. Though, maybe double check that you are indeed compiling with these flags?

What’s the CUDA Driver version? Maybe there’s a mismatch and you should remove the “cuda12.2” option?

Sorry that I can’t be more helpful,
Mat

Hi Mat,

This statement prompted me to try something different.

You have both “-gpu=managed” and “-stdpar=gpu” on the command line, both of which enable UM.

I recompiled without the “-stdpar=gpu” flag, and the code worked without a hitch! I assume UM was getting disabled when I used both flags. Strange.

I will recompile and run my codes with the same change and see if that works as well. Will keep you updated.

Driver Version: 535.104.12 CUDA Version: 12.2

Cheers,
Jyoti

Odd, I’ve used both flags before numerous times so this would be unexpected, though something is going on with the stdpar flag. Though it does seem system specific but what?

One thing to try is changing the code to:

  subroutine vecaddgpu( r, a, b, n )
   real, dimension(:) :: r, a, b
   integer :: n
   integer :: i
!acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
!$acc kernels loop present(r,a,b)
   do i = 1, n
    r(i) = a(i) + b(i)
   enddo
  end subroutine

The “present” clause should cause a runtime error if UM isn’t actually being used. Not sure it will get use closer to the root cause, but might confirm what’s going on.

Also, what compiler version are you using?

I’ve tried to reproduce the error with every release since 22.5, but am using a P100 with an older CUDA driver.

Hi Mat,

Looks like UM is not being used. Here is the error I get:

hostptr=0x7f0009f4d020,stride=1,size=1000000000,eltsize=4,name=a(:n),flags=0x200=present,async=-1,threadid=1
FATAL ERROR: data in PRESENT clause was not found on device 1: name=a(:n) host:0x7f0009f4d020
 file:f1.F90 vecaddgpu line:8

Compiler: nvfortran 23.9-0 64-bit target on x86-64 Linux -tp ivybridge

Cheers,
Jyoti

Ok, it’s definitely disabling UM when both flags are enabled, but since I can reproduce it, I unfortunately have no idea why. Plus since it works on the other system with as you said the exact same configuration, it’s a mystery.

Given it works as expect when you take of -stdpar and you’re not using DO CONNCURRENT, let’s move forward with that.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.