OpenMP kernel is too big?

Hello,

I have a big OpenMP loop that I would like to offload on the GPU. I tried several options, using unified memory and explicit omp target map, using TARGET TEAMS DISTRIBUTE PARALLEL DO and TARGET LOOP

I constantly run into memory issues: memory alignment issues

With unified shared memory and LOOP, I tried using NV_ACC_DEBUG=1 on my laptop, and it ran out of memory:

Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory

Okay, that makes sense, my loop uses a lot of large arrays.

So I switched to TARGET TEAMS DISTRIBUTE PARALLEL DO and I got the following error:

NVFORTRAN-S-1101-The maximum stack size for a GPU kernel or procedure is limited to 524288 bytes: 4618020 

What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.

I have read multiple posts on forums that explain how the maximum stack size is computed (e.g., here What is the maximum CUDA Stack frame size per Kerenl.), but I am surprized that the limit is the same on my laptop’s GPU and an A100?

Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?

I need to say that my code is in Fortran, if that matters.

Hi Camille,

Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory

Unified memory is able to oversubscribe the GPU memory so while not performant given the data would be paged back and forth between the CPU and GPU, UM shouldn’t give an out of memory error. Though UM only works with allocated data, so this could be coming from a fixed size array, or large private arrays.

Though if I understand correctly, you only see this with “LOOP”? In that case my best guess it’s do to some private arrays. Loop is likely using more total threads, which in turn uses more private arrays and memory.

What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.

The stack size is software limit and independent of the hardware.

Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?

Without a reproducing example, it’s difficult to give specific recommendations. Though the stack size usage typically has to do with the subroutine calls. If you have many subroutine calls, or a few with local fixed size arrays, that can increase usage.

You can increase the stack size via a CUDA Fortran call to cudaDeviceSetLimit, or via setting the environment variable NV_ACC_CUDA_STACKSIZE

However, there is still a hard limit however the exact value of this limit is variable. Given you’re using about 8x the default stack size, there’s a good possibility that you’ll hit this limit.

If my guess is correct and the issue being caused by local fixed size arrays in subroutines, the work around would be to hoist the declaration of these out of the subroutine and into the host routine where the target directive is located. Then add the array to a private clause and pass it into the subroutine.

If I my guess is not correct, I’ll need a reproducing example to better help determine the issue and how best to solve it.

-Mat

Hi Mat,

Thank you very much for your answer! I am trying to prepare a reproducer, since I cannot share portions of the real code. But your intuition is correct: I am using a lot of big private arrays, and a lot of subroutine calls (not many different calls though: I only have three).

My arrays are not declared in the subroutines though. For the moment, I am either counting on the unified shared memory, or I am using map( allocate ). By the way, is it considered good practice to mix using unified shared memory and some explicit map instructions?

Thank you,
Camille

It’s fine and better for portability since UM is a compiler feature, not OpenMP. For allocated arrays/pointers, the map directives will effectively be ignored given the data is already allocated (it won’t allocate them twice), so you’ve added a bit more work for yourself, but it’s not wasted effort if/when you decide to not use UM.