Delay between multiple kernel calls

I am running an iterative flow using code something like this:

Profiling shows a gap between kernel calls:
image

of roughly 3ms. Data size is 4M doubles and on a P100.

Can someone explain how to avoid this delay? Also, I’m open to other implementations. I don’t want to iterate the loop until all blocks have been processed and I assume I need to do that by exiting the kernel and calling cudaDeviceSyncronize() like I’m doing. Thanks for any guidance.

I assume you are using managed memory for storage of outer_loop_global_there_were_changes. (if not, please explain what it is and disregard the following.) Because you are on a Pascal or newer GPU, and I assume you are not on Windows, this line of code:

if (*outer_loop_global_there_were_changes == false) {

will almost certainly trigger a CPU page fault. The servicing of that page fault could be “costly”.

For a single bool quantity like that, I would switch the storage scheme for that from managed memory to pinned/zero-copy memory. If your kernel code is constantly hammering on that (e.g. setting it to true) this switch from managed to pinned could have a negative perf impact (although you are likely hitting a page fault in the other direction as well, based on what I see here), in which case I would seek to minimize the times you write to that location from the kernel, for example keeping a local copy that you update whenever, and then update the global location from the location once per threadblock, when that threadblock retires.

Your assumptions were correct. Switching to pinned/zero-copy memory removed the 3ms delay per kernel call. Unfortunately, the cudaHostAlloc took some additional time. Still an overall win especially if the number of iterations increases.

I examined the number of writes and it was really big. Essentially every pixel was updating that variable. I added another level of logic where threads write to a shared variable and then one writes to the pinned memory just once. That did not have any noticeable affect on the runtime.

Thank you for your quick and correct solution, Robert.