OpenMP kernel is too big?

camillemtl · August 14, 2023, 6:51am

Hello,

I have a big OpenMP loop that I would like to offload on the GPU. I tried several options, using unified memory and explicit omp target map, using TARGET TEAMS DISTRIBUTE PARALLEL DO and TARGET LOOP…

I constantly run into memory issues: memory alignment issues

With unified shared memory and LOOP, I tried using NV_ACC_DEBUG=1 on my laptop, and it ran out of memory:

Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory

Okay, that makes sense, my loop uses a lot of large arrays.

So I switched to TARGET TEAMS DISTRIBUTE PARALLEL DO and I got the following error:

NVFORTRAN-S-1101-The maximum stack size for a GPU kernel or procedure is limited to 524288 bytes: 4618020

What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.

I have read multiple posts on forums that explain how the maximum stack size is computed (e.g., here What is the maximum CUDA Stack frame size per Kerenl.), but I am surprized that the limit is the same on my laptop’s GPU and an A100?

Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?

I need to say that my code is in Fortran, if that matters.

MatColgrove · August 14, 2023, 4:29pm

Hi Camille,

Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory

Unified memory is able to oversubscribe the GPU memory so while not performant given the data would be paged back and forth between the CPU and GPU, UM shouldn’t give an out of memory error. Though UM only works with allocated data, so this could be coming from a fixed size array, or large private arrays.

Though if I understand correctly, you only see this with “LOOP”? In that case my best guess it’s do to some private arrays. Loop is likely using more total threads, which in turn uses more private arrays and memory.

What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.

The stack size is software limit and independent of the hardware.

Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?

Without a reproducing example, it’s difficult to give specific recommendations. Though the stack size usage typically has to do with the subroutine calls. If you have many subroutine calls, or a few with local fixed size arrays, that can increase usage.

You can increase the stack size via a CUDA Fortran call to cudaDeviceSetLimit, or via setting the environment variable NV_ACC_CUDA_STACKSIZE

However, there is still a hard limit however the exact value of this limit is variable. Given you’re using about 8x the default stack size, there’s a good possibility that you’ll hit this limit.

If my guess is correct and the issue being caused by local fixed size arrays in subroutines, the work around would be to hoist the declaration of these out of the subroutine and into the host routine where the target directive is located. Then add the array to a private clause and pass it into the subroutine.

If I my guess is not correct, I’ll need a reproducing example to better help determine the issue and how best to solve it.

-Mat

camillemtl · August 14, 2023, 6:39pm

Hi Mat,

Thank you very much for your answer! I am trying to prepare a reproducer, since I cannot share portions of the real code. But your intuition is correct: I am using a lot of big private arrays, and a lot of subroutine calls (not many different calls though: I only have three).

My arrays are not declared in the subroutines though. For the moment, I am either counting on the unified shared memory, or I am using map( allocate ). By the way, is it considered good practice to mix using unified shared memory and some explicit map instructions?

Thank you,
Camille

MatColgrove · August 14, 2023, 7:09pm

It’s fine and better for portability since UM is a compiler feature, not OpenMP. For allocated arrays/pointers, the map directives will effectively be ignored given the data is already allocated (it won’t allocate them twice), so you’ve added a bit more work for yourself, but it’s not wasted effort if/when you decide to not use UM.

Topic		Replies	Views
Out of memory error of DGX A100 (40 GB) nvc, nvc++ and nvfortran	3	896	August 21, 2023
Out of range error with openmp gpu offload nvc, nvc++ and nvfortran	10	1015	February 1, 2023
Memcpy and seg fault problems when combining openMP and CUDA Legacy PGI Compilers	4	10584	April 14, 2010
Omp_target_alloc never returns NULL nvc, nvc++ and nvfortran	4	581	July 28, 2023
Out of Memory nvc, nvc++ and nvfortran	6	724	October 13, 2023
memory exhausted on GPU CUDA Programming and Performance	3	1060	September 9, 2014
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9682	December 7, 2020
Question about unified memory in cuda fortran Legacy PGI Compilers	3	3435	November 20, 2017
cuMemAllocManaged returns out of memory with -stdpar=gpu nvc, nvc++ and nvfortran	5	673	February 6, 2023
"Segmentation Fault" with openMP nvc, nvc++ and nvfortran	3	1332	May 10, 2022

OpenMP kernel is too big?

Related topics