cuMemAllocManaged returns out of memory with -stdpar=gpu

masins · February 5, 2023, 6:49pm

Hi,

I have a the following simple Fortran code:

program myProgram
  use cudafor
  use constants
  implicit none
  integer :: st
  real, dimension(:,:,:)  , allocatable :: x, y, z
  allocate (x(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate x(:,:,:)"
  allocate (y(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate y(:,:,:)"
  allocate (z(NX, NY, NZ), stat=st); if ( st /= 0 ) stop " Unable to allocate z(:,:,:)"
  call grid(x, y, z)
  deallocate(x, y, z)
end program

Variables NX, NY, NZ are defined within the following module:

module constants  
  implicit none 
  integer, parameter :: NX = 1024
  integer, parameter :: NY = 1024
  integer, parameter :: NZ = 1024
end module constants

The subroutine “grid” is the following one:

subroutine grid(x, y, z)          
  use cudafor
  use constants
  implicit none              
  real, intent(out), dimension(NX, NY, NZ) :: x, y, z
  integer :: ix, iy, iz
  do concurrent (ix=1:NX, iy=1:NY, iz=1:NZ)
    x(ix, iy, iz) = (ix-1)*dx
    y(ix, iy, iz) = (iy-1)*dx
    z(ix, iy, iz) = (iz-1)*dx
  end do  
end subroutine

If I compile the code using nvfortran with these flags: “-O3 -cuda -stdpar=gpu”, when I run it I get the following error:
“__man_alloc04: call to cuMemAllocManaged returned error 2: Out of memory
Aborted”

Instead, if I compile the code changing the flags to “-O3 -cuda -stdpar=multicore” everything works fine.

If I lower the dimensions NX, NY, NZ, using for example NX=NY=NZ=32 the program works also with the -stdpar=gpu flag.

I’m running on Ubuntu 22.04 using Windows 10 subsystem for Linux (WSL2). My graphic card is an Nvidia Quadro T2000 and the system RAM is 64 GB.

Anyone can help me in understanding if there something wrong in my code or in the compilation? Or is there some compatibility problem with my OS/hardware configuration?

Thank you in advance.

Robert_Crovella · February 6, 2023, 3:52pm

You’re running out of GPU memory. Each one of those arrays (x, y, z) is using something like 4 or 8 Gigabytes. The quadro T2000 has 4GB of memory. It works when you make the arrays small enough to fit in memory, and it works in the multicore case because that is using the CPU not the GPU (and you have enough CPU memory to hold those arrays.) And (managed memory in) WSL2 does not support oversubscription of GPU memory.

masins · February 6, 2023, 4:01pm

Ok @Robert_Crovella , so if I understand well, in case I would run the program on a native Linux OS (not in WSL) on the same hardware it would run correctly?
Thank you.

Robert_Crovella · February 6, 2023, 4:09pm

I’m not certain of that, but I think it may work. Even if it does work, performance is likely to be disappointing. Oversubscription (if it works) will result in demand-paged movement of data for the x,y,z arrays as they are being processed. While it may work, that methodology is generally slow. It’s questionable whether what you have shown here is even sensible to do on a GPU under any setting, but if you wanted higher performance (than the “naive” demand-paged oversubscription case, if it works) the typical approach would be to break your matrices into chunks, and use an overlapped copy-compute pipeline to move the chunks to the GPU, process them, and move the results back.

That still isn’t likely to be that interesting, performance wise, for the case you have shown here.

masins · February 6, 2023, 4:23pm

Understood. Thank you for the help.

Topic		Replies	Views
Out of Memory nvc, nvc++ and nvfortran	5	855	October 13, 2023
Unified memory (cudaMallocManaged) unable to oversubscribe GPU memory on sm_60, Telsa P100 CUDA Programming and Performance	23	3451	June 25, 2017
Nvfortran with gpu flags nvc, nvc++ and nvfortran	3	223	March 10, 2025
cuMemAlloc-always out of memory but dont know why... Memory Problem CUDA Programming and Performance	4	4019	April 27, 2011
Unified memory and overprovisioning CUDA Programming and Performance cuda	4	1717	February 20, 2022
cudaMallocManaged() not allocating memory in device memory CUDA Programming and Performance	4	2240	August 22, 2018
cudaMallocManaged allocating more memory than requested CUDA Programming and Performance	7	3466	July 13, 2018
Out of memory error, even if allocatable arrays are deallocated (nvfortran) nvc, nvc++ and nvfortran	1	76	April 24, 2025
[BUG]: std::bad_alloc: out_of_memory: CUDA error cuOpt cuopt	4	317	August 25, 2025
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4760	April 23, 2009

cuMemAllocManaged returns out of memory with -stdpar=gpu

Related topics