I assume you are using managed memory for storage of outer_loop_global_there_were_changes
. (if not, please explain what it is and disregard the following.) Because you are on a Pascal or newer GPU, and I assume you are not on Windows, this line of code:
if (*outer_loop_global_there_were_changes == false) {
will almost certainly trigger a CPU page fault. The servicing of that page fault could be “costly”.
For a single bool
quantity like that, I would switch the storage scheme for that from managed memory to pinned/zero-copy memory. If your kernel code is constantly hammering on that (e.g. setting it to true
) this switch from managed to pinned could have a negative perf impact (although you are likely hitting a page fault in the other direction as well, based on what I see here), in which case I would seek to minimize the times you write to that location from the kernel, for example keeping a local copy that you update whenever, and then update the global location from the location once per threadblock, when that threadblock retires.