Hello,
I was testing some code that uses cudaMallocManaged function to allocate some data structures to be accessible from both device and host. I noticed that the function actually allocates almost 100 times the size of requested memory. I thought that this happens because the cudaMalloc functions allocate 512-byte alligned memory blocks on my GPU, so I’ve made some experiments. It turns out that while the cudaMalloc function allocates the quantity of requested memory, the cudaMallocFunction simply doesn’t, allocating a lot more bytes of memory. I attach my testing code:
I would think that cudaMallocManaged() is allocating memory pages corresponding to your operating system’s page size. This is because pages of managed memory need to be swapped between CPU and GPU on demand.
If you need to make more fine grained allocations, consider putting your own memory pooling logic on top of memory returned by cudaMallocManaged().
There is definitely an allocation granularity, and it may be larger than the size of the allocations you are requesting. This would inflate the amount of memory actually used.
So I should write my own memory manager in order to allocate bigger amounts of data and redistribute it?
For example if I request 512MB of memory in small portions of 512 bytes, cuda will almost saturate my VRAM. This can be tested with this code:
If you allocate 512MB in one block, that will require about 512MB of your GPU memory. The difference between 512MB and 700MB is CUDA overhead. It should not increase (much) as you allocate more memory.
For example, if you do the first allocation of 512MB, you may witness that 700MB is used on the GPU. After that, if you allocate another 512MB, you should witness about 1200MB used on the GPU.
Only you can decide if it is worth it to use a sub-allocator.
You’re operating in a virtual memory space where addresses may or may not have physical memory pages (on either CPU or GPU or both) behind them.
If you plan to do memory pooling in CUDA, google for existing solutions first. There may be quite a few libraries available already - optimized for different use cases.
The method actually matches with VS profiler data, so I assume that cuda allocates memory sequentially. It is fun to notice that there is no Garbage Collection mechanism behid the allocator and/or the compiler (but maybe that is due to optimization level set to debug).
I tried to search for CUDA optimized suballocators, but there are not suitable for my purposes. Wish me good luck so…