Out of memory error of DGX A100 (40 GB)

I am having an issue with running a Fortran code on DGX A100 (40 GB) using SLURM and module of HPC SDK from NGC catalog. The code runs out of memory.

The code uses OpenMP and OpenACC and involves calculations like FFT and convolutions. I had no issue running the same code on DGX V100 (32 GB) with docker (without SLURM). When I reduce the problem size by changing parameters in the code, it runs without an issue.

I am attaching a screenshot of the error. I also tried setting the environment variable NV_ACC_CUDA_STACKSIZE=64. But that did not help. Would appreciate any inputs.

Hi ranjith.kunnath,

There not enough information here to determine the root cause, but let’s get clues from the information you provide.

From the present table dump, it appears that the program has allocated ~31GB of device memory. This plus the CUDA Context of about 400MB, this barely fits on your 32GB V100 and this presumes that you’re not using any private arrays which get allocated when you launch a kernel. I’d be surprised if docker/SLURM has impact since these should just be scheduling a node and environment. Hence it’s unclear why this would work on a V100 and not an A100.

You say you’re using both OpenMP and OpenACC. Do you have OpenACC kernels inside of the OpenMP parallel regions? If so, do you run more OpenMP threads on the A100 system? In other words, could the problem be that you’re running more threads on a single GPU thus using more memory?

Also, do you know where in the out-of-memory message is coming from and what it’s trying to allocate? If not, try running under the compute-sanitizer utility or under cuda-gdb. The size matches the size of the “conv” arrays.


The unit here is in bytes, so setting it to 64 is very small. Though the stack mostly applies to calling device subroutines and I’d expect you get an illegal address error, not an out-of-memory error if this was too small. Hence setting this is unlikely to help.

Having a reproducing example or more information about the program might help offer clues to the issue.


Hi Mat,

Thanks for the response.

I checked we are not in fact using OpenMP for this run of the code. The only OpenACC directives used are !$acc function and !$acc kernels. Mostly, we are trying to parallelize some loops using !$acc kernels.

Just an observation. The memory per core on V100 is 32000/5120=6.25 MB whereas on A100 it is 40000/6912=5.78 MB. So actually there seems to be less memory per core in A100. Can this be the reason for the out of memory error? Is there a workaround? Is there a way to limit, for example, to limit the number of threads to 5120 instead of 6912 on the A100?

We have a number of arrays such as conv11, conv12, conv13, conv21, conv22, conv23, conv31, conv32 and conv33. There is no report related to conv33 in the screenshot. conv33 involves a bessel function calculation. The other conv arrays involve more complicated calculations.

As seen in the screenshot, the arrays such as conv11, conv12,… are of size (1024,512). The intriguing part is that when the arrays are reduced to (512,256), the code runs on A100 as well as on V100. But with (1024,512), it runs only on V100.


Memory per core doesn’t matter. What does matter is if you’re using more threads. If each thread has a private arrays, the more threads the more memory that can be used. Though you haven’t indicated if you’re using private variable so it’s unclear if this is a cause. If the out-of-memory error is occurring during a kernel launch, then it’s likely the allocation of the private array.

There is no report related to conv33 in the screenshot.

Ok, it could be failing when allocating conv33, though you’ll need to do the analysis to determine this or provide a reproducing example so I can investigate.

Besides compute-sanitizer and cuda-gdb, another tool is to set the environment variable “NV_ACC_NOTIFY”. This sets a bit mask which you can use to have the OpenACC runtime provide information about the run.

Report kernel launches: 1
Report data movement: 2
Report enter/exit of compute regions: 4
Report async/wait: 8
Report about device allocation: 16

Hence for here I’d recommend looking at the kernel launches and data allocation, i.e. set NV_ACC_NOTIFY=17