cumemalloc speed

I’m observing a significant (10x) difference in cumemalloc/cumemfree speed depending on whether cublas is initialized or a small change in the size argument.

I asked about this on stackoverflow a couple of days ago:

Details and code in the link.

I would very much appreciate any help on this matter.

Thanks,
Gabor

It seems quite possible that what you are observing are simply artifacts of the interaction between layered memory allocators. Memory allocations are typically handled by layers of allocators and sub-allocators: While most allocations can be satisfied by the fastest, top-level, allocator, occasionally new memory allocations require falling back to the next-lower allocator. At what times this happens depends on the malloc / free pattern as well as the size of individual allocations.

This is quite the same situation one finds in host code, where a C runtime-library malloc() satisfies most requests from its local storage pool, until that pool runs low and it needs to go to the slower OS allocator to get a fresh chunk of memory.

A 10x performance difference between the fast top-level allocator and the allocator one level lower seems very much within expectations.

Applications requiring predictable timing for all allocations will often allocate memory for a memory pool at the start of the application and then manage that memory pool themselves.

If that’s the case I must bite the bullet and implement a memory allocator indeed. But maybe it’s just an issue with the linux kernel or the exact version the cuda toolkit I have installed. Can anyone reproduce the large difference in speed?