I’m wondering why cudaMalloc & friends are relatively slow ~1 ms. This wasn’t a problem in the past because I usually only allocate buffers once. But now, I need to use multiple CUDA streams and would like to allocate them dynamically each time because:
-
having static buffers private to each thread might be wasteful
-
make the code simpler by making it stateless - convolution should be stateless, but up to now, I had to have a context containing temporary buffers as state.
I tested 1000 iterations of cudaMalloc followed by cudaFree and got these times:
Tmalloc,Free = 5*10^-4s for 1 MiB
2 * 10^-3s for 32 MiB
which means it’s too expensive to do dynamic allocation each time. For main memory allocation, each pair takes 3 * 10^-6 s.
Does cudaMalloc have to read free lists or what ever from the GPU across the PCIX bus? If so, I suppose the time will be better in the future when the CPU also moves to the GPU as I saw in Bill Dally’s Stanford lecture describing a GPU in 2017:
2500 throughput processors/arithmetic units
16 low latency processors
128 GiB memory