I have a big OpenMP loop that I would like to offload on the GPU. I tried several options, using unified memory and explicit
omp target map, using
TARGET TEAMS DISTRIBUTE PARALLEL DO and
I constantly run into memory issues: memory alignment issues
With unified shared memory and
LOOP, I tried using
NV_ACC_DEBUG=1 on my laptop, and it ran out of memory:
Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory
Okay, that makes sense, my loop uses a lot of large arrays.
So I switched to
TARGET TEAMS DISTRIBUTE PARALLEL DO and I got the following error:
NVFORTRAN-S-1101-The maximum stack size for a GPU kernel or procedure is limited to 524288 bytes: 4618020
What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.
I have read multiple posts on forums that explain how the maximum stack size is computed (e.g., here What is the maximum CUDA Stack frame size per Kerenl.), but I am surprized that the limit is the same on my laptop’s GPU and an A100?
Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?
I need to say that my code is in Fortran, if that matters.