analysis of memory usage on GPU

LSCH · March 10, 2016, 5:28pm

Hi,

I use the environment variable PGI_ACC_NOTIFY set to 2 to get information about data transfers from the host to the GPU. I have a simulation that stops with an “out of memory” error and I would like to get a memory profile to make sure that only data that is really necessary on the device is copied over and not incidentally any other data due to an error in the code.

First, I noticed that data of an array seems to be transferred/reported in several chunks ( n equally sized chunks + 1 different size chunk to make up the total).

However, when adding up all the bytes in a spreadsheet I get a number much lower than that reported by the out of memory error:

Out of memory allocating 186119568 bytes of device memory
total/free CUDA memory: 6039339008/154578944

The total computed by the spreadsheet (I rechecked several times) is about 2.1 GB. So I am wondering who/what is using up the difference of about 3.xGB.
Does PGI_ACC_NOTIFY 2 not report all data transfers? Or is there some auxiliary data needed/allocated on the GPU alongside my explicit data transfers?

Thanks,
LS

MatColgrove · March 10, 2016, 8:39pm

Hi LS,

Are you allocating memory in both CUDA and OpenACC, or using any CUDA libraries such as cuFFT or cuBLAS where you’re having the library manage the data?

We do have an optimization in our runtime where we manage the OpenACC memory pool. Since allocation/deallocation is expensive, we wont actually free memory back to the CUDA driver and instead re-use it for the next allocation. The problem arises when the code mixes between CUDA driver managed memory and OpenACC data management. The OpenACC data management pool can grow enough so that the CUDA allocation will fail.

You can disable the OpenACC memory management by setting the environment variable “PGI_ACC_MEM_MANAGE=0”. Also, you can force the memory manager to free all unused memory by calling the PGI routine “acc_clear_freelists()”.

Are you running multiple MPI process or OpenMP threads?

I did have another user that was getting out of memory errors as well. In his case, he had 12 MPI process all executing OpenACC code. While the system has 6 GPUs, we wasn’t setting which process should use which GPU, so the default was used and all of them ended up using device 0. The caused him to run out of GPU memory.

Does PGI_ACC_NOTIFY 2 not report all data transfers? Or is there some auxiliary data needed/allocated on the GPU alongside my explicit data transfers?

I believe it does.

Mat

LSCH · March 10, 2016, 10:23pm

Hi Mat,

so far I am an OpenACC purist, i.e. haven’t had to use CUDA or any CUDA libraries yet. If I understand you correctly then I should not run into problems related to memory managed by OpenACC runtime vs. managed by CUDA driver, correct?

So, if I am only using OpenACC who will notify me when I run out of memory, OpenACC runtime or CUDA driver?

In this instance I have a single process and several threads, but all my acc statements are outside of any parallel sections. So OpenACC should behave like in a single threaded environment.

I will check your suggested env variable and PGI routines to see whether the picture changes.

Thanks,
LS

MatColgrove · March 10, 2016, 11:34pm

so far I am an OpenACC purist, i.e. haven’t had to use CUDA or any CUDA libraries yet. If I understand you correctly then I should not run into problems related to memory managed by OpenACC runtime vs. managed by CUDA driver, correct?

Correct. You shouldn’t encounter this if using pure OpenACC.

So, if I am only using OpenACC who will notify me when I run out of memory, OpenACC runtime or CUDA driver?

The OpenACC runtime catches the error if the call to cudaMalloc fails. If you call cudaMalloc directly, you the programmer need to check the error code to see if it failed.

I have seen cases where an uninitialized variable was being used as the size of an array in a data clause and caused an out of memory error. But this showed up in the debug output so I doubt is the case here.

Though, you might want to review the output from “PGI_ACC_DEBUG=1” which is more exhaustive than “PGI_ACC_NOTIFY=2”.

Mat

LSCH · March 15, 2016, 11:44am

Hi Mat,

thanks for the hint with PGI_ACC_DEBUG=1, this indeed reveals all device memory allocations. The reason why I didn’t see everything with PGI_ACC_NOTIFY=2 is that it only reports data transfers triggered by copy or update. So only device memory allocation as a result of copy will show up whereas device memory allocation due to acc create is not reported.
I guess the goal of PGI_ACC_NOTIFY=2 is to help analyze expensive host-device data transfers and not to monitor device memory allocation. The manual makes this clear actually, but I did not read it carefully enough. Sorry for the confusion.

Thanks,
LS

Topic		Replies	Views
Overload when using managed mempory flag in openacc nvc, nvc++ and nvfortran hpc , gpu	3	755	May 31, 2022
Inconsistent performance with !$acc exit data copyout finalize and NV_ACC_MEM_MANAGE environmental variable nvc, nvc++ and nvfortran	4	326	July 11, 2024
Profiling OpenAcc code using Nsight System nvc, nvc++ and nvfortran cuda	1	585	February 22, 2024
out of memory Legacy PGI Compilers	10	7803	April 28, 2011
Code works with PGI_ACC_DEBUG=1 but fails without it Legacy PGI Compilers	5	4229	October 19, 2017
CUDA Unified Memory By PGI Legacy PGI Compilers	5	5718	April 6, 2016
catching CUDA out-of-memory in OpenACC program Legacy PGI Compilers	3	2998	October 2, 2018
Direct GPU-to-GPU data transfer with OpenACC+managed+MPI nvc, nvc++ and nvfortran	4	1339	April 12, 2022
OpenACC Multi GPU Memory Informations nvc, nvc++ and nvfortran	5	464	January 31, 2024
16.1 run-time Out of memory allocating x byte device memory Legacy PGI Compilers	5	5746	March 10, 2016

analysis of memory usage on GPU

Related topics