I’m observing a significant (10x) difference in cumemalloc/cumemfree speed depending on whether cublas is initialized or a small change in the size argument.
I asked about this on stackoverflow a couple of days ago:
It seems quite possible that what you are observing are simply artifacts of the interaction between layered memory allocators. Memory allocations are typically handled by layers of allocators and sub-allocators: While most allocations can be satisfied by the fastest, top-level, allocator, occasionally new memory allocations require falling back to the next-lower allocator. At what times this happens depends on the malloc / free pattern as well as the size of individual allocations.
This is quite the same situation one finds in host code, where a C runtime-library malloc() satisfies most requests from its local storage pool, until that pool runs low and it needs to go to the slower OS allocator to get a fresh chunk of memory.
A 10x performance difference between the fast top-level allocator and the allocator one level lower seems very much within expectations.
Applications requiring predictable timing for all allocations will often allocate memory for a memory pool at the start of the application and then manage that memory pool themselves.
If that’s the case I must bite the bullet and implement a memory allocator indeed. But maybe it’s just an issue with the linux kernel or the exact version the cuda toolkit I have installed. Can anyone reproduce the large difference in speed?