I noticed that in my application I have a big overhead from the allocation/de-allocation routines (cudaMallocPitch, cudaFree). I need a lot of temporary images (pitch-linear memory), etc…
Will get worse in the future i suppose, because execution time for my kernels will go down (faster GPU), but the time for allocation/de-allocation will stay constant.
I am wondering if there is some nice open-source Custom memory allocator, holding a memory pool or something like that, for Cuda. If possible, especially steered towards allocation/de-allocation of images (which can be tens of megabyte big).
Is there some other useful allocator available for CUDA ? It could be also a allocator for CPU memory, if it could be easily modified (replace the CPU allocation/free routines by GPU allocation/free routines).
I assume you have already examined the possibility of re-using existing allocations to avoid malloc/free cycles? A faster CPU might also help since allocations involve mostly administrative overhead on the host. I have never measured the speed of CUDA allocations relative to CPU speed, though, not sure how much impact that has. My general advice is to pair fast GPUs with the highest single-thread performance CPUs to avoid becoming bottle-necked on serial tasks (at the moment this means CPUs with >= 3.5 GHz).
Custom memory allocators are exactly that, they are customized for each use case. This makes it unlikely that there is code out there that does exactly what is right for your application. I have in the past written simple sub-allocators or memory pool implementations on CPUs in about one work day, so “rolling your own” seems like a possibility.