16.1 run-time Out of memory allocating x byte device memory

Hi, there,

My code runs with version 15.10.
After updated to version 16.1, my code frequently stopped with message like:

Out of memory allocating 13421760 bytes of device memory
total/free CUDA memory: 2147155968/13303808
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 2.1
host:0xfed2360 device:0x2031a0000 size:13421760 presentcount:1+0 line:164 name:rublten_edge
host:0x7f6be80bfe80 device:0x202020000 size:13421760 presentcount:1+0 line:164 name:mass_edge
host:0x7ffcc849e4b0 device:0x201f20a00 size:96 presentcount:1+0 line:164 name:descriptor
host:0x7ffcc849ee50 device:0x201f20000 size:96 presentcount:1+0 line:164 name:descriptor
call to cuMemAlloc returned error 2: Out of memory


At first I am thinking this is a problem with GPU memory not cleared after previous failure. Then I am thinking it could be related to GPU and CPU bindings, as I have 6 GPU cards, and running 12 MPI talks.

Appreciate your help,

Thanks,

Wei

Hi Wei,

Do you know how much GPU memory each MPI process should use?
Can you reduce the problem size to see if using less memory works?
How are you binding MPI process to each GPU?
Does the error still occur after a system reboot?
Are there other processes running on the GPUs at the same time?
How do you allocate device memory? Using OpenACC data regions, CUDA C, CUDA Fortran, a combination?

Since it works with 15.10, it could be a compiler problem as well. If you can send us a reproducing example, that would be great and I can investigate, but if not, then sending me (trs@pgroup.com) the output logs from setting the environment variable “PGI_ACC_DEBUG” from both the 15.10 and 16.1 built binaries. I may be able to tell what’s wrong but at least it might tell us if there is a problem.

  • Mat

Hello, Mat,

Here is the source code:
!$acc data copyin(rublten_Edge(1:nVertLevels, 1:nEdgesSolve)), &
!$acc copyin(mass_edge(1:nVertLevels, 1:nEdgesSolve)), &
!$acc copy(tend_u(1:nVertLevels, 1:nEdgesSolve))
!$acc kernels loop gang vector collapse(2) independent
do i = 1, nEdgesSolve
do k = 1, nVertLevels
tend_u(k,i)=tend_u(k,i)+rublten_Edge(k,i)*mass_edge(k,i)
enddo
enddo
!$acc end kernels
!$acc end data


So for this loop, the parameters are:

nCells = 163842
nEdges = 491520
nVertices = 327680
nVertLevelsP1 = 42
nVertLevels = 41

Where NEdgesSolve is less than nEdges.

So the Memory need is less than: 491520428*3 = 473M.

There is a way to reduce the problem size (but will take quite some time).

I did not bind MPI process to GPU, and I’d like to know how (as I have googled, but have not find anything useful yet).

I did not try system reboot, as I am not the sys-admin, and there are quite
few people are using this system (for different projects).

I do not think there are other processes are using GPU, as there are no
visualization job going on, and I am the only one using GPU (for computing).

As you can see above, I use OpenACC.

It is hard to get a simple reproducible example.
But if you are willing to check the whole code, I have no problem to
share the whole code (and data with you).

Thanks,

Wei

But if you are willing to check the whole code, I have no problem to
share the whole code (and data with you).

That works for me. Please either send the code or instructions on how to get the code to trs@pgroup.com and ask them to forward it to me.

Also, please include instruction on how to build and run the code, as well as include any needed data sets.

Thanks,
Mat

Mat,

I will set the source code and data set as instructed.

Thanks,

Wei

Quote from Mat:

My best guess as to the problem is that all 12 MPI processes are using the same GPU. If you don’t explicitly set the device, the default device (typically device 0) is used.

I have some boilerplate code that you’re welcome to use that sets MPI process to GPUs. See the code in the section titled “Step 1” from my article “5x in 5 hours” article: http://www.pgroup.com/lit/articles/insider/v4n1a3.htm

I checked Mat’s paper, and used his code to bind the GPU/CPU which solved my problem. And it also confirmed Mat’s guess, which my code (before change) had all MPI processes using device 0.

Thank you Mat!

Wei