cuMemAlloc/cuMemFree perfomance Their implementation has anything to do inside the device?

What I found is that device memory allocation works much slower than usual allocation routines on the host.
This seems rather weird. To my mind, memory allocation has nothing to do inside the device, and can be completely handled by the driver.

I also found that memory allocation on the device becomes especially slow for large allocation sizes. With approximately 40GB/s speed for my card.
Seems that device memory allocation routines do something with allocated memory (maybe clear it?)
The lag is in both cuMemAlloc() and cuMemFree(), the later is insignificantly quicker.

This slowdown pushes me to a workaround - allocate the whole card memory in a single block, and than use this single block as a heap - for allocation using standard approaches, which don’t clear anything.

Any ideas in regard to this issue? Maybe there’s a kind of setting switching off this “clearing” behavior?

Hey NVIDIA gurus, please respond!
O tmurray, where art thou? :)

It’s not clearing anything at the moment. It’s definitely going to be much slower than your average malloc call because it has to hit the kernel (the kernel-mode resource manager has to be aware that you’re allocating memory to prevent potential explosions from multiple clients later on).

This does give me an idea for a future feature, though…

Thank you for response!

If it only performs some certain serialization of kernel calls (if I understood you correct), how this explains this 40GB/s asymptotic of allocation speed?
I mean that I found that allocation time is proportional to the allocation size. This won’t be the case if there only was kernel synchronization, would it?

For example, it takes about 0.03 seconds (!) to allocate 850MB.
I suppose this is a rather long period of time.
What work is it doing all this time?