cudaMalloc, cudaFree speed

I’m wondering why cudaMalloc & friends are relatively slow ~1 ms. This wasn’t a problem in the past because I usually only allocate buffers once. But now, I need to use multiple CUDA streams and would like to allocate them dynamically each time because:

  1. having static buffers private to each thread might be wasteful

  2. make the code simpler by making it stateless - convolution should be stateless, but up to now, I had to have a context containing temporary buffers as state.

I tested 1000 iterations of cudaMalloc followed by cudaFree and got these times:

Tmalloc,Free = 5*10^-4s for 1 MiB
2 * 10^-3s for 32 MiB

which means it’s too expensive to do dynamic allocation each time. For main memory allocation, each pair takes 3 * 10^-6 s.

Does cudaMalloc have to read free lists or what ever from the GPU across the PCIX bus? If so, I suppose the time will be better in the future when the CPU also moves to the GPU as I saw in Bill Dally’s Stanford lecture describing a GPU in 2017:

2500 throughput processors/arithmetic units
16 low latency processors
128 GiB memory

Hello,

My understanding of cudaMalloc (and alike) is that you almost have to stop the card to update its virtual memory mapping, perhaps i’m wrong, but that would explain why cudaMalloc is so costly and why it’s not a constant overhead because the driver would certainly have an allocation cache if it’s so expensive. Anyone has a better clue ?

Cédric

I am currently facing the same problem, however I notice different (unexpected) behaviours on different cards:

As a test, I sum up the time it takes for all calls to cudaMallocArray and cudaFree together.

This test is run on 3 machines with following result:

  • On a GTX 460, I spend 770 ms, (driver version 304.32)
  • on a GTX690 it takes 1468 ms, (driver 304.51)
  • and on a gtx 480 it takes 988 ms (slightly older cpu) (driver 304.32)

Interestingly, if I compare the execution times for the kernels, then all cards behave as expected, the GTX460 is slowest, the GTX690 fastest.

Can anyone explain this behaviour?

Thanks a lot for your kind replies in advance!

Olli