Hello,

I have been trying to accelerate an algorithm that has large thread divergence. The simplified version of the kernel is as follows:

```
for each pixel in parallel...
get list of neighboring nonzero pixels
while( pixel list size is above a threshold )
do calculations
do serial (non-associative and non-communitive) calculations
do more calculations
if( current points are a good fit )
save results to output and break
else
remove some (at least 1) outliers
```

The while loop can loop potentially 1 to n - threshold times, which causes huge thread divergence in each block as completed threads must wait for the slowest thread in the block. Is there any well-known mechanism that could potentially help alleviate this divergence? I have tried lowering the block size to 32 hoping to minimize the divergence, but the GPU code is still slower than the serial implementation.

I also tried further parallelizing the algorithm so that the calculations are done in parallel within each block (so that each block represents a single pixel in the outer ‘for each’ loop) by giving threads a piece of the neighboring pixel list to perform calculations on, which performs reduce operations frequently. This version is a little slower than the first implementation, probably because of a section of the code that must be done serially, as well as all the syncs needed by the many reduce operations.

Any help with being pointed in the right direction is greatly appreciated!