Device Memory Allocation Is device memory allocation always slow?

A while back I was pondering why a CUDA implementation of a sequential app was slower than the sequential original for small problem sizes. I did a thorough performance analysis and found that in terms of actual computation the CUDA implementation was in fact much much faster. The reason it took longer was all the house keeping associated with device management. No surprises there, the programming guide emphasizes the cost of host-device memory transfer, but when I dug deeper I found that it was device memory allocation which was taking all the time. I don’t have the numbers in front of me now but I think it was something in the region of 100ms regardless of the size of the allocation, much less than the time of the actual transfer from host to that memory.

Is this normal?

IIRC the first cuda call loads a load of driver stuff and has significant overhead. Was the memory allocation your first call?

In generall the memory managment of the nvidia driver is pretty slow. Therefore in applications which need to do a lot of allocating and deallocating of memory I always allocate a large chunk of device / pinned memory and manage that myself using a very simple cusom first fit allocator.

(This is not trying to bash the driver, which has to take care of multiple users, multiple applications etc., a lot of stuff which the private allocator doesn’t have to care about. not to meantion the trip to kernel space which is not needed in this case.)

It seems like this could be easily fixed in the cuda runtime by simply getting a relatively large page from the driver (maybe 1MB or 1% of a device’s total memory) and then allocating out of that until it runs out. If they wanted to take it a bit further they could create a memory pool that is local to every application that is grown on allocations that exceed the current size and then only released when the program terminates. In the stead state of an application that should perform well even for many small allocations/deallocations…

Yes, basically… The first call that actually does anything with the device…

This is kinda what I was thinking. If memory allocations etc take a long time why not allocate a huge chunk and then take care of smaller allocation within that chunk… Tigga’s answer above is worthy of some investigation…