It seems quite possible that what you are observing are simply artifacts of the interaction between layered memory allocators. Memory allocations are typically handled by layers of allocators and sub-allocators: While most allocations can be satisfied by the fastest, top-level, allocator, occasionally new memory allocations require falling back to the next-lower allocator. At what times this happens depends on the malloc / free pattern as well as the size of individual allocations.
This is quite the same situation one finds in host code, where a C runtime-library malloc() satisfies most requests from its local storage pool, until that pool runs low and it needs to go to the slower OS allocator to get a fresh chunk of memory.
A 10x performance difference between the fast top-level allocator and the allocator one level lower seems very much within expectations.
Applications requiring predictable timing for all allocations will often allocate memory for a memory pool at the start of the application and then manage that memory pool themselves.