cudaFree isn't cleaning global memory

Thorn_Striff · June 9, 2010, 10:01pm

Hello folks,

i wrote a code where i allocate 6 (large, > 100MB each) vectors using cudaMalloc. After i use all vectors on my kernels i clean everything with cudaFree for each one of those.
Well, what’s going on is, after a few executions (my test device have the global memory needed) it start to returns “Out of Memory” error when i try to allocate anything again.

Looks like there is something dont being freed when i call “cudaFree” or something like these. Can someone suggest me some test or debug method to understand why im receiving this “Out of memory” error after a few executions?

SPWorley · June 9, 2010, 10:09pm

Likely what you’re hitting is memory address space fragmentation. The memory really is freed and available, but the repeated allocs and frees scatter your existing allocs all through address space. The available memory addresses are also scattered in chunks… there may be 2GB free but that’s summed over dozens of smaller say 100 MB regions.

So when you try to alloc 200MB, there’s 2GB free, but there’s no single 200MB chunk to alloc and the memory alloc fails.

The workaround is to always minimize your allocations… better to allocate many chunks of 50MB each rather than one chunk of 250MB.

This is an identical problem on the CPU as well, one I’ve often hit in 32 bit Windows (with only 2GB of address space). 64 bit makes this disappear on the CPU.

I suspect that Fermi, running on a 64 bit OS, would not have this issue.

TL;DR: you’re fragmenting your freed memory. Don’t alloc/free so often, and if you do, use many small chunks and not one big chunk.

fna · June 9, 2010, 10:22pm

Pinned memory could help you get more space possibly.

From the programming guide (3.1 beta section 3.2.5)

I haven’t used it before, but it is something to check out.

Thorn_Striff · June 9, 2010, 11:31pm

I already have a goal to implement a new algoritm where i allocate shorter vectors (n vectors with 200/n MB instead only one big vector with 200MB), but it will take some time to finish the algoritm.

I will try to play with a 64 bits OS and see if it gets better.

Wow! This is a really good function. Don’t know if it will help me with this damn address space fragmentation error, but will help a lot on dealing with a small memory capability device. Thanks.

tera · June 9, 2010, 11:37pm

Note that the address space fragmentation happens in the device address space, so a 64 bit host OS is not going to help.

SPWorley · June 10, 2010, 12:09am

Fermi has 64 bit addressing, but only on a 64 bit host OS.

eelsen · June 10, 2010, 12:24am

You could also just allocate one big chunk up front and handle the memory management yourself. Or just don’t free the vectors - allocate large enough that they won’t need to be resized and then leave them alone?

These techniques will only help if you don’t need to free the memory to make room for something else.

SPWorley · June 10, 2010, 3:09am

This can be a very effective strategy… I’ve used it several times for 32 bit CPU coding.

The key is not to try anything fancy, don’t implement your own malloc() or anything, just use your knowledge of your application and how it needs to use memory. It’s likely you first to do a wave of allocs, perhaps many and very big, do lots of work, then free them all for a new wave of mallocs. In that case, you can see how you could use a big initial malloc and just set your own pointers inside of it each “wave.” You just use a counter and increment it for each of your allocs until you’re done, and when you reset and need to free them all, you reset your counter pointer back to the start of your big chunk.

Of course there are complexities, like what if you can’t get a single big chunk to begin with, but the same strategy can be divided heirarchically if necessary. The fancier you get the more troublesome it will be though, so keep it simple if you can.

tera · June 10, 2010, 11:46am

So does Fermi have a (P)MMU then, so that it could take advantage? I’ve occasionally seen people hinting at this in the forum, but not seen any evidence.

wumpus · June 28, 2010, 1:46pm

Even Tesla has a 64 bit MMU :) Just no support for 64 bit pointers in device code, Fermi added that.

tera · June 29, 2010, 12:26am

Interesting. Do you have a reference for that?
What do you use a 64 bit MMU for if you only have 32 bit pointers and only one concurrent context per device?

wwa · June 29, 2010, 1:49am

Well, one “use” I could think of would be: “So you don’t have to design it twice”.

wwa · June 29, 2010, 1:49am

Well, one “use” I could think of would be: “So you don’t have to design it twice”.