I don’t think you are doing anything wrong.
Allocating Memory in parallel is not really manageable. It is like
a critical section where there can only be on thread in time using this
function. That’s why the time is increasing linear.
Otherwise, how would the driver know which thread gets which address space on the memory?
Correct me if I’m wrong but i can not imagine how this could be done in a real parallel manner.
I think you should always call malloc only once and divide the allocated area between threads. This will also give you possibility of coalesced reads while multiple allocation in kernel doesn’t I believe.
There is a paper “XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines” describing how to do parallel memory allocation efficiently on GPU.
The author was an intern in nVIDIA when he finished the job. So I assumed that the memory allocator in CUDA 3.2 used his algorithm. However, according to my experiments, this is not the case.
I think the current memory allocator in CUDA 3.2 is just a naive implementation. Maybe someday they will use a more efficient algorithm.