Allocate GPU buffers use a lot of time

Hi everyone
I write a CUDA code, which calculate 256*256 matrix product. I use 256 blocks and 256 threads. however to allocate GPU buffers (three matrix 2 for input 1 for output) use a lot of time (83ms). is it normal? How can I reduce it?

I use
cudaMalloc((void**)&dev_c, size * sizeof(float));
cudaMalloc((void**)&dev_a, size * sizeof(float));
cudaMalloc((void**)&dev_b, size * sizeof(float));
to allocate GPU buffers.

Probably what you are experiencing is CUDA start up time, if these operations are the first CUDA operations in your code. If you do more precise timing, you may find that the first allocation uses most of the time, and the remaining allocations, less.

Any CUDA program will experience some start-up overhead, as the CUDA runtime initializes. This can be influenced by a lot of factors, but from a timing perspective, the key is to isolate the start-up time from the rest of your timing measurements.

One thing you can try is to put


at the start of your code. This doesn’t change the start-up time, but from a timing perspective it will all get “absorbed” into the cudaFree operation/line of code, and so your other measurements may be less affected by it.

Thank you for your suggestion, I will test it.

Yes you are right, when I add this
The time for allocate GPU buffers becomes 0ms.

Thanks for your help