Very slow kernel launch after a number of kernel has been lauched.

Hi everyone,

I met a very weird problem with my own program. There are many different kernels in my program. I found that after a number of different kernels have been launched, CUDA will do some very time-costly thing (0.5s on my machine) to launch the next kernel, even if this kernel has been launched before. I knew there is a warm up time for CUDA, but that is just before the first kernel being launched, isn’t it? Has anyone met this problem before, or is there any workaround? Thanks!

Regards,
Jun

New discovery: I found that this time is cost on cudaMalloc。I have quite a lot of memory free on GPU(more than 800MB), and the cudaMalloc is going to allocate just 450x4KB(=1.8MB) space for an array. I don’t why this happened. Is there some trick about how to use cudaMalloc? Please note that this is not the first cudaMalloc in my whole program. Thanks!

I would be looking at your timing - CUDA kernel launches are asynchronous, so what you are attributing to malloc might well be from the kernel before it.

The correct way to time using host side timers should be like this in pseudocode:

timerStart()

myKernel <<<>>> ()

cudaThreadSynchronize()

timerStop()

because if you don’t, you will only measure the launch time, not the run time, and any blocking call (cudamalloc, cudamemcpy, etc) will block until the kernel finishes.

Thanks a lot. I found that because the kernel right before it actually runs very slow on GPU. I didn’t know that the kernel launch returns immediately. I thought it would return after GPU has finished calculation. Thank you again!

Regards,

Jun