i wrote a code where i allocate 6 (large, > 100MB each) vectors using cudaMalloc. After i use all vectors on my kernels i clean everything with cudaFree for each one of those.
Well, what’s going on is, after a few executions (my test device have the global memory needed) it start to returns “Out of Memory” error when i try to allocate anything again.
Looks like there is something dont being freed when i call “cudaFree” or something like these. Can someone suggest me some test or debug method to understand why im receiving this “Out of memory” error after a few executions?
Likely what you’re hitting is memory address space fragmentation. The memory really is freed and available, but the repeated allocs and frees scatter your existing allocs all through address space. The available memory addresses are also scattered in chunks… there may be 2GB free but that’s summed over dozens of smaller say 100 MB regions.
So when you try to alloc 200MB, there’s 2GB free, but there’s no single 200MB chunk to alloc and the memory alloc fails.
The workaround is to always minimize your allocations… better to allocate many chunks of 50MB each rather than one chunk of 250MB.
This is an identical problem on the CPU as well, one I’ve often hit in 32 bit Windows (with only 2GB of address space). 64 bit makes this disappear on the CPU.
I suspect that Fermi, running on a 64 bit OS, would not have this issue.
TL;DR: you’re fragmenting your freed memory. Don’t alloc/free so often, and if you do, use many small chunks and not one big chunk.
You could also just allocate one big chunk up front and handle the memory management yourself. Or just don’t free the vectors - allocate large enough that they won’t need to be resized and then leave them alone?
These techniques will only help if you don’t need to free the memory to make room for something else.
This can be a very effective strategy… I’ve used it several times for 32 bit CPU coding.
The key is not to try anything fancy, don’t implement your own malloc() or anything, just use your knowledge of your application and how it needs to use memory. It’s likely you first to do a wave of allocs, perhaps many and very big, do lots of work, then free them all for a new wave of mallocs. In that case, you can see how you could use a big initial malloc and just set your own pointers inside of it each “wave.” You just use a counter and increment it for each of your allocs until you’re done, and when you reset and need to free them all, you reset your counter pointer back to the start of your big chunk.
Of course there are complexities, like what if you can’t get a single big chunk to begin with, but the same strategy can be divided heirarchically if necessary. The fancier you get the more troublesome it will be though, so keep it simple if you can.