kernel malloc() efficiency really bad

I am testing the memory allocator in CUDA.

According to my experiments, the malloc() in kernel functions is very inefficient.

I malloc() 1024 8-byte blocks in each thread, and test it under different gridDim/blockDim configuration.

It seems that the time consumed grows almost linearly with the number of threads used.

I am using GTX480 on Ubuntu 10.10 x86_64.

Am I doing something wrong, or it is just so designed?

Here is some results:


grid	block	time(in seconds)

1	1	0.038703

1	2	0.0665588

1	4	0.128362

1	8	0.282228

1	16	0.702546

1	32	1.99408

2	32	2.41123

4	32	5.44741

8	32	10.7305

8	64	23.8434

8	128	44.0881

8	256	94.8158

I don’t think you are doing anything wrong.
Allocating Memory in parallel is not really manageable. It is like
a critical section where there can only be on thread in time using this
function. That’s why the time is increasing linear.

Otherwise, how would the driver know which thread gets which address space on the memory?

Correct me if I’m wrong but i can not imagine how this could be done in a real parallel manner.



I think you should always call malloc only once and divide the allocated area between threads. This will also give you possibility of coalesced reads while multiple allocation in kernel doesn’t I believe.


Thanks for your reply, Tobi.

There is a paper “XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines” describing how to do parallel memory allocation efficiently on GPU.

The author was an intern in nVIDIA when he finished the job. So I assumed that the memory allocator in CUDA 3.2 used his algorithm. However, according to my experiments, this is not the case.

I think the current memory allocator in CUDA 3.2 is just a naive implementation. Maybe someday they will use a more efficient algorithm.