cudaMalloc, cudaFree speed

Uncle_Joe · August 4, 2010, 4:15pm

I’m wondering why cudaMalloc & friends are relatively slow ~1 ms. This wasn’t a problem in the past because I usually only allocate buffers once. But now, I need to use multiple CUDA streams and would like to allocate them dynamically each time because:

having static buffers private to each thread might be wasteful
make the code simpler by making it stateless - convolution should be stateless, but up to now, I had to have a context containing temporary buffers as state.

I tested 1000 iterations of cudaMalloc followed by cudaFree and got these times:

Tmalloc,Free = 5*10^-4s for 1 MiB
2 * 10^-3s for 32 MiB

which means it’s too expensive to do dynamic allocation each time. For main memory allocation, each pair takes 3 * 10^-6 s.

Does cudaMalloc have to read free lists or what ever from the GPU across the PCIX bus? If so, I suppose the time will be better in the future when the CPU also moves to the GPU as I saw in Bill Dally’s Stanford lecture describing a GPU in 2017:

2500 throughput processors/arithmetic units
16 low latency processors
128 GiB memory

gonnet · August 6, 2010, 12:20pm

Hello,

My understanding of cudaMalloc (and alike) is that you almost have to stop the card to update its virtual memory mapping, perhaps i’m wrong, but that would explain why cudaMalloc is so costly and why it’s not a constant overhead because the driver would certainly have an allocation cache if it’s so expensive. Anyone has a better clue ?

CÃ©dric

olidem · April 4, 2013, 2:47pm

I am currently facing the same problem, however I notice different (unexpected) behaviours on different cards:

As a test, I sum up the time it takes for all calls to cudaMallocArray and cudaFree together.

This test is run on 3 machines with following result:

On a GTX 460, I spend 770 ms, (driver version 304.32)
on a GTX690 it takes 1468 ms, (driver 304.51)
and on a gtx 480 it takes 988 ms (slightly older cpu) (driver 304.32)

Interestingly, if I compare the execution times for the kernels, then all cards behave as expected, the GTX460 is slowest, the GTX690 fastest.

Can anyone explain this behaviour?

Thanks a lot for your kind replies in advance!

Olli

Topic		Replies	Views
Extremely slow cudaMalloc CUDA Programming and Performance	2	12290	September 1, 2011
cudaMalloc takes several seconds CUDA Programming and Performance	6	2494	August 13, 2013
Is cudaMalloc slow when called multiple times? CUDA Programming and Performance	3	142	July 5, 2024
cudaFreeHost consistently 20x slower than free/cudaFree (full runnable example code available) CUDA Programming and Performance	5	919	July 26, 2022
cudaMalloc's speed is two slow CUDA Programming and Performance	5	1467	November 9, 2010
Exceptionally slow cudaMalloc() after upgrading to driver version 384.66 on Linux CUDA Programming and Performance	2	588	September 18, 2017
Why does cudaMallocHost takes so muck time compared to malloc? CUDA Programming and Performance	9	2109	August 26, 2011
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1104	October 28, 2014
Why cudamalloc and cudaFree so expensive? CUDA Programming and Performance cuda	7	2683	November 14, 2020
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2362	September 3, 2010

cudaMalloc, cudaFree speed

Related topics