cuMemAllocManaged returns out of memory with -stdpar=gpu

Hi,

I have a the following simple Fortran code:

program myProgram
  use cudafor
  use constants
  implicit none
  integer :: st
  real, dimension(:,:,:)  , allocatable :: x, y, z
  allocate (x(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate x(:,:,:)"
  allocate (y(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate y(:,:,:)"
  allocate (z(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate z(:,:,:)"
  call grid(x, y, z)
  deallocate(x, y, z)
end program

Variables NX, NY, NZ are defined within the following module:

module constants  
  implicit none 
  integer, parameter :: NX = 1024
  integer, parameter :: NY = 1024
  integer, parameter :: NZ = 1024
end module constants

The subroutine “grid” is the following one:

subroutine grid(x, y, z)          
  use cudafor
  use constants
  implicit none              
  real, intent(out), dimension(NX, NY, NZ) :: x, y, z
  integer :: ix, iy, iz
  do concurrent (ix=1:NX, iy=1:NY, iz=1:NZ)
    x(ix, iy, iz) = (ix-1)*dx
    y(ix, iy, iz) = (iy-1)*dx
    z(ix, iy, iz) = (iz-1)*dx
  end do  
end subroutine

If I compile the code using nvfortran with these flags: “-O3 -cuda -stdpar=gpu”, when I run it I get the following error:
“__man_alloc04: call to cuMemAllocManaged returned error 2: Out of memory
Aborted”

Instead, if I compile the code changing the flags to “-O3 -cuda -stdpar=multicore” everything works fine.

If I lower the dimensions NX, NY, NZ, using for example NX=NY=NZ=32 the program works also with the -stdpar=gpu flag.

I’m running on Ubuntu 22.04 using Windows 10 subsystem for Linux (WSL2). My graphic card is an Nvidia Quadro T2000 and the system RAM is 64 GB.

Anyone can help me in understanding if there something wrong in my code or in the compilation? Or is there some compatibility problem with my OS/hardware configuration?

Thank you in advance.

You’re running out of GPU memory. Each one of those arrays (x, y, z) is using something like 4 or 8 Gigabytes. The quadro T2000 has 4GB of memory. It works when you make the arrays small enough to fit in memory, and it works in the multicore case because that is using the CPU not the GPU (and you have enough CPU memory to hold those arrays.) And (managed memory in) WSL2 does not support oversubscription of GPU memory.

Ok @Robert_Crovella , so if I understand well, in case I would run the program on a native Linux OS (not in WSL) on the same hardware it would run correctly?
Thank you.

I’m not certain of that, but I think it may work. Even if it does work, performance is likely to be disappointing. Oversubscription (if it works) will result in demand-paged movement of data for the x,y,z arrays as they are being processed. While it may work, that methodology is generally slow. It’s questionable whether what you have shown here is even sensible to do on a GPU under any setting, but if you wanted higher performance (than the “naive” demand-paged oversubscription case, if it works) the typical approach would be to break your matrices into chunks, and use an overlapped copy-compute pipeline to move the chunks to the GPU, process them, and move the results back.

That still isn’t likely to be that interesting, performance wise, for the case you have shown here.

Understood. Thank you for the help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.