cudaMalloc execution time

When allocating a certain size of GPU memory using cudaMalloc, the execution time is approximately 200 ms.
This is only the first time; from the second time onwards, if the same size of memory is allocated in the same way, the execution time is around 10 ms.

cudaMalloc((void**)&a, (width * height * sizeof(int)));

Why is there such a large difference? Is there any way to shorten the time required for the initial memory allocation?

If the first cudaMalloc()is the first CUDA API call overall in your application, it will also trigger initialization of a CUDA context, which can be a fairly costly operation (much of this time is spent mapping all GPU and system memory into a single unified virtual address map).

The classical trick to trigger CUDA context initialization at a point that is more convenient is to issue a cudaFree(0). No idea whether this still works, but worth a try.

CUDA initialization time can also be influenced by module loading time. To minimize the upfront time expenditure for this at initialization time and defer it to the point of use, you would want to set the environment variable CUDA_MODULE_LOADING=LAZY (this may already be the default depending on platform).

Generally speaking, the time for CUDA context initialization and calls to cudaMalloc() correlates strongly with the single thread performance of the CPU of the host system (with system memory performance a weak secondary factor). High single thread performance in CPUs in turn correlates strongly with CPU clock frequency. For this reason I recommend using CPUs with a base frequency >= 3.5 GHz. Nowadays, CPUs with up to physical 48 cores that satisfy this criterion are available.

It should still work. However, since CUDA 12 the documented way to trigger context initialization is calling cudaSetDevice.

2 Likes