Hello,
I have a big OpenMP loop that I would like to offload on the GPU. I tried several options, using unified memory and explicit omp target map
, using TARGET TEAMS DISTRIBUTE PARALLEL DO
and TARGET LOOP
…
I constantly run into memory issues: memory alignment issues
With unified shared memory and LOOP
, I tried using NV_ACC_DEBUG=1
on my laptop, and it ran out of memory:
Accelerator Fatal Error: call to cuMemAlloc returned error 2: Out of memory
Okay, that makes sense, my loop uses a lot of large arrays.
So I switched to TARGET TEAMS DISTRIBUTE PARALLEL DO
and I got the following error:
NVFORTRAN-S-1101-The maximum stack size for a GPU kernel or procedure is limited to 524288 bytes: 4618020
What I find surprising is that I get this error message on my laptop (MX550), but also on a server equipped with an A100.
I have read multiple posts on forums that explain how the maximum stack size is computed (e.g., here What is the maximum CUDA Stack frame size per Kerenl.), but I am surprized that the limit is the same on my laptop’s GPU and an A100?
Is there any recommendation to avoid this problem, besides breaking my loop into multiple smaller ones that handle smaller data structures?
I need to say that my code is in Fortran, if that matters.