Custom Memory allocator for Cuda desired

I noticed that in my application I have a big overhead from the allocation/de-allocation routines (cudaMallocPitch, cudaFree). I need a lot of temporary images (pitch-linear memory), etc…

Will get worse in the future i suppose, because execution time for my kernels will go down (faster GPU), but the time for allocation/de-allocation will stay constant.

I am wondering if there is some nice open-source Custom memory allocator, holding a memory pool or something like that, for Cuda. If possible, especially steered towards allocation/de-allocation of images (which can be tens of megabyte big).

I know there is a custom memory allocator in the ‘Cub’ library (GitHub - NVlabs/cub: THIS REPOSITORY HAS MOVED TO github.com/nvidia/cub, WHICH IS AUTOMATICALLY MIRRORED HERE.).

Is there some other useful allocator available for CUDA ? It could be also a allocator for CPU memory, if it could be easily modified (replace the CPU allocation/free routines by GPU allocation/free routines).

I assume you have already examined the possibility of re-using existing allocations to avoid malloc/free cycles? A faster CPU might also help since allocations involve mostly administrative overhead on the host. I have never measured the speed of CUDA allocations relative to CPU speed, though, not sure how much impact that has. My general advice is to pair fast GPUs with the highest single-thread performance CPUs to avoid becoming bottle-necked on serial tasks (at the moment this means CPUs with >= 3.5 GHz).

Custom memory allocators are exactly that, they are customized for each use case. This makes it unlikely that there is code out there that does exactly what is right for your application. I have in the past written simple sub-allocators or memory pool implementations on CPUs in about one work day, so “rolling your own” seems like a possibility.

update: The section 4.3 of the baidu paper (http://arxiv.org/pdf/1512.02595v1.pdf) gives hints on that topic (when dealing with ‘large’ GPU memory allocations).
A simple (CPU) implementaiton of the ‘buddy’ memory allocator can be found at GitHub - cloudwu/buddy: Buddy memory allocation or "Buddy Memory Allocation" Question, and related slides by a NVIDIA guy are at http://iwcse.phys.ntu.edu.tw/parallel/Oct17/Jon-Yu_Lee_131017.pptx
And there is a nice paper from HPG14 http://www.fi.muni.cz/~xvinkl/articles/hpg2014.pdf , slides at http://www.highperformancegraphics.org/2014/wp-content/uploads/sites/3/2014/07/Vinkler-Allocator.pdf and code (BSD license) of their ‘CMAlloc’ allocator can be found at http://decibel.fi.muni.cz/~xvinkl/CMalloc/
ScatterAlloc (GitHub - ComputationalRadiationPhysics/scatteralloc: ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU) and the newer mallocMC (GitHub - alpaka-group/mallocMC: mallocMC: Memory Allocator for Many Core Architectures) is open source with a MIT license, but seems to be biased towards small and repetttive allocations (according to the HPG14 paper).
Think I will use the ‘CMalloc’ allocator from the HPG14 paper.