analysis of memory usage on GPU

Hi,

I use the environment variable PGI_ACC_NOTIFY set to 2 to get information about data transfers from the host to the GPU. I have a simulation that stops with an “out of memory” error and I would like to get a memory profile to make sure that only data that is really necessary on the device is copied over and not incidentally any other data due to an error in the code.

First, I noticed that data of an array seems to be transferred/reported in several chunks ( n equally sized chunks + 1 different size chunk to make up the total).

However, when adding up all the bytes in a spreadsheet I get a number much lower than that reported by the out of memory error:

Out of memory allocating 186119568 bytes of device memory
total/free CUDA memory: 6039339008/154578944

The total computed by the spreadsheet (I rechecked several times) is about 2.1 GB. So I am wondering who/what is using up the difference of about 3.xGB.
Does PGI_ACC_NOTIFY 2 not report all data transfers? Or is there some auxiliary data needed/allocated on the GPU alongside my explicit data transfers?

Thanks,
LS

Hi LS,

Are you allocating memory in both CUDA and OpenACC, or using any CUDA libraries such as cuFFT or cuBLAS where you’re having the library manage the data?

We do have an optimization in our runtime where we manage the OpenACC memory pool. Since allocation/deallocation is expensive, we wont actually free memory back to the CUDA driver and instead re-use it for the next allocation. The problem arises when the code mixes between CUDA driver managed memory and OpenACC data management. The OpenACC data management pool can grow enough so that the CUDA allocation will fail.

You can disable the OpenACC memory management by setting the environment variable “PGI_ACC_MEM_MANAGE=0”. Also, you can force the memory manager to free all unused memory by calling the PGI routine “acc_clear_freelists()”.


Are you running multiple MPI process or OpenMP threads?

I did have another user that was getting out of memory errors as well. In his case, he had 12 MPI process all executing OpenACC code. While the system has 6 GPUs, we wasn’t setting which process should use which GPU, so the default was used and all of them ended up using device 0. The caused him to run out of GPU memory.

Does PGI_ACC_NOTIFY 2 not report all data transfers? Or is there some auxiliary data needed/allocated on the GPU alongside my explicit data transfers?

I believe it does.

  • Mat

Hi Mat,

so far I am an OpenACC purist, i.e. haven’t had to use CUDA or any CUDA libraries yet. If I understand you correctly then I should not run into problems related to memory managed by OpenACC runtime vs. managed by CUDA driver, correct?

So, if I am only using OpenACC who will notify me when I run out of memory, OpenACC runtime or CUDA driver?

In this instance I have a single process and several threads, but all my acc statements are outside of any parallel sections. So OpenACC should behave like in a single threaded environment.

I will check your suggested env variable and PGI routines to see whether the picture changes.

Thanks,
LS

so far I am an OpenACC purist, i.e. haven’t had to use CUDA or any CUDA libraries yet. If I understand you correctly then I should not run into problems related to memory managed by OpenACC runtime vs. managed by CUDA driver, correct?

Correct. You shouldn’t encounter this if using pure OpenACC.

So, if I am only using OpenACC who will notify me when I run out of memory, OpenACC runtime or CUDA driver?

The OpenACC runtime catches the error if the call to cudaMalloc fails. If you call cudaMalloc directly, you the programmer need to check the error code to see if it failed.



I have seen cases where an uninitialized variable was being used as the size of an array in a data clause and caused an out of memory error. But this showed up in the debug output so I doubt is the case here.

Though, you might want to review the output from “PGI_ACC_DEBUG=1” which is more exhaustive than “PGI_ACC_NOTIFY=2”.

  • Mat

Hi Mat,

thanks for the hint with PGI_ACC_DEBUG=1, this indeed reveals all device memory allocations. The reason why I didn’t see everything with PGI_ACC_NOTIFY=2 is that it only reports data transfers triggered by copy or update. So only device memory allocation as a result of copy will show up whereas device memory allocation due to acc create is not reported.
I guess the goal of PGI_ACC_NOTIFY=2 is to help analyze expensive host-device data transfers and not to monitor device memory allocation. The manual makes this clear actually, but I did not read it carefully enough. Sorry for the confusion.

Thanks,
LS