Optimization and synchronization with CUDA


I have a code with the following structure that I optimized with CUDA.

while (true) {
if(convergenceFlag == true) break;

Each kernel performs a single loop (from 1 to N) and must be completely finished before the next one is executed. In addition, kernel1 performs a minimum reduction.

As synchronization is only possible for CUDA blocks, the synchronization I added in kernel1 does not work globally, but final results are acceptable, even with the local synchronization. The final execution times are good, but profiling indicated too much overhead of kernel launches (about 60%), as the total number exceeds millions calls to kernel1, kernel2, and kernel3.

I tried changing the code including just one kernel call, where the while loop is inside the kernel but I wanted better synchronization, so I used only 1 block. However the execution time increased too much, about 36x higher, even though final results were precisely correct.

This last code was written in two ways: 1) each thread accesses sequential positions of the array using a local loop with indexes computed from the thread ID (like default static scheduling in OpenMP), and 2) each thread accesses strided positions incremented by the number of threads of the block.

I think that using only 1 block causes memory contentions among the threads. My question is if it is possible to improve performance using only 1 block.