Huge divergence with fitting algorithm


I have been trying to accelerate an algorithm that has large thread divergence. The simplified version of the kernel is as follows:

for each pixel in parallel...
  get list of neighboring nonzero pixels
  while( pixel list size is above a threshold )
    do calculations
    do serial (non-associative and non-communitive) calculations
    do more calculations
      if( current points are a good fit )
        save results to output and break
        remove some (at least 1) outliers

The while loop can loop potentially 1 to n - threshold times, which causes huge thread divergence in each block as completed threads must wait for the slowest thread in the block. Is there any well-known mechanism that could potentially help alleviate this divergence? I have tried lowering the block size to 32 hoping to minimize the divergence, but the GPU code is still slower than the serial implementation.

I also tried further parallelizing the algorithm so that the calculations are done in parallel within each block (so that each block represents a single pixel in the outer ‘for each’ loop) by giving threads a piece of the neighboring pixel list to perform calculations on, which performs reduce operations frequently. This version is a little slower than the first implementation, probably because of a section of the code that must be done serially, as well as all the syncs needed by the many reduce operations.

Any help with being pointed in the right direction is greatly appreciated!

Have each thread process multiple pixels. The threads that finish early will grab an additional pixel and re-enter the while loop.

Reduce your total thread count to ~10000 threads and use a grid-striding loop to process pixels. If you can pack enough pixels into a single thread, and there is any sense of a “mean” while loop count per pixel, then you should be able to “average out” the variations per pixel, resulting in less divergence.

This modification could still be subject to worst-case situations, e.g. where one thread consistently happens to pick up the pixels that are longest in processing duration. To solve that problem, it would help if you had some a-priori information about pixel processing times (while-loop extents). With that information, you could sort or otherwise re-arrange the pixel processing order to group pixels that have approximately similar processing times together.

Thank you, your advice to stick with around ~10000 threads was what I needed! I already had a grid-stride loop, so all I had to do was lower the grid size. With that single modification, the code ran 3 times faster. Doing this and lowering the dynamically allocated shared memory (using the implementation where the calculations are done in parallel) saw a 6 times speedup!

look into “persistent threads” approach and dynamic parallelism