Calling kernel in a loop spends much time in cudaFree

hello, i have one problem about calling kernel in a loop.

now i use a kernel in a for-loop.
when i choose the number of iteration by small value, it is no problem.
But when i increase the number of iteration, the computation time is increased particularly in cudaFree.

so my question is following

  1. Is a calling kernel in a loop related to spend much time in cudaFree or cudaDeviceSynchronize?

gpu : gtx1060
memory size : 180000 * sizeof(float)
thread : 1024
block : about 177

thank you for your any answer.

cudaDeviceSynchronize() waits for the GPU to finish before allowing the CPU thread continue. This may involve a polling busy loop (100% utilization of one CPU core) to achieve the lowest possible latency.

cudaFree() probably implicitly synchronizes the CUDA context as well because altering the memory heap on the device while it’s still computing would be unacceptably risky.

It’s likely the time spent in these API calls is just waiting for the GPU to finish. When you say that there is no problem in low iteration counts, it may also be that you are hitting some kind of limit on the kernel launch queue that causes blockage in larger iteration counts. Impossible to tell without knowing details of your kernel (i.e. how long does it compute for one iteration)

In general it is good advice to keep heap allocations out of tight compute loops. It’s better to allocate enough memory for your use case once and reuse that in the inner loops over and over. In performance critical cases you might have to allocate several buffers in page locked host memory, making use of CUDA streams to overlap memory transfers and compute.