Very slow kernel launch after a number of kernel has been lauched.

bigbearzhu · June 7, 2010, 5:30am

Hi everyone,

I met a very weird problem with my own program. There are many different kernels in my program. I found that after a number of different kernels have been launched, CUDA will do some very time-costly thing (0.5s on my machine) to launch the next kernel, even if this kernel has been launched before. I knew there is a warm up time for CUDA, but that is just before the first kernel being launched, isn’t it? Has anyone met this problem before, or is there any workaround? Thanks!

Regards,
Jun

bigbearzhu · June 7, 2010, 7:37am

New discovery: I found that this time is cost on cudaMallocã€‚I have quite a lot of memory free on GPU(more than 800MB), and the cudaMalloc is going to allocate just 450x4KB(=1.8MB) space for an array. I don’t why this happened. Is there some trick about how to use cudaMalloc? Please note that this is not the first cudaMalloc in my whole program. Thanksï¼

avidday · June 7, 2010, 7:47am

I would be looking at your timing - CUDA kernel launches are asynchronous, so what you are attributing to malloc might well be from the kernel before it.

The correct way to time using host side timers should be like this in pseudocode:

timerStart()

myKernel <<<>>> ()

cudaThreadSynchronize()

timerStop()

because if you don’t, you will only measure the launch time, not the run time, and any blocking call (cudamalloc, cudamemcpy, etc) will block until the kernel finishes.

bigbearzhu · June 7, 2010, 7:54am

Thanks a lot. I found that because the kernel right before it actually runs very slow on GPU. I didn’t know that the kernel launch returns immediately. I thought it would return after GPU has finished calculation. Thank you again!

Regards,

Jun

I would be looking at your timing - CUDA kernel launches are asynchronous, so what you are attributing to malloc might well be from the kernel before it.

The correct way to time using host side timers should be like this in pseudocode:
timerStart()

myKernel <<<>>> ()

cudaThreadSynchronize()

timerStop()
because if you don’t, you will only measure the launch time, not the run time, and any blocking call (cudamalloc, cudamemcpy, etc) will block until the kernel finishes.

Topic		Replies	Views
Cudamalloc affects the delay of cudalaunchkernel CPU launching latency CUDA Programming and Performance cuda , kernel	2	717	November 30, 2021
cudaFree extremely slow CUDA Programming and Performance	15	2181	February 6, 2020
Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy CUDA Programming and Performance	1	3340	June 23, 2009
CudaMalloc is taking huge time for first time, How to overcome this issue CUDA Programming and Performance cuda	1	1059	April 12, 2021
cudaMalloc takes several seconds CUDA Programming and Performance	6	2510	August 13, 2013
malloc & cudaMalloc confusion over initialization of the two CUDA Programming and Performance	27	3732	September 30, 2010
Memory Allocation Time Takes too much time!! CUDA Programming and Performance	3	4583	August 28, 2009
CudaMalloc is too expensive and GPU Memories CUDA Programming and Performance	6	2755	January 22, 2016
Very slow kernel launches CUDA Programming and Performance	8	7770	March 28, 2015
Why does kernel memory initiation time vary for repeated kernel launches CUDA Programming and Performance	4	910	March 11, 2018

Very slow kernel launch after a number of kernel has been lauched.

Related topics