I met a very weird problem with my own program. There are many different kernels in my program. I found that after a number of different kernels have been launched, CUDA will do some very time-costly thing (0.5s on my machine) to launch the next kernel, even if this kernel has been launched before. I knew there is a warm up time for CUDA, but that is just before the first kernel being launched, isn’t it? Has anyone met this problem before, or is there any workaround? Thanks!
New discovery: I found that this time is cost on cudaMallocã€‚I have quite a lot of memory free on GPU(more than 800MB), and the cudaMalloc is going to allocate just 450x4KB(=1.8MB) space for an array. I don’t why this happened. Is there some trick about how to use cudaMalloc? Please note that this is not the first cudaMalloc in my whole program. Thanksï¼
Thanks a lot. I found that because the kernel right before it actually runs very slow on GPU. I didn’t know that the kernel launch returns immediately. I thought it would return after GPU has finished calculation. Thank you again!