Most of the programming focus here is to focus on many parallel threads grouped in blocks that are passed off to the multiprocessors as necessary. However, the programming literature (and example projects) talk about the performance hits of divergent threads inside a warp.
Unfortunately, there are plenty of algorithms that could easily parallelized, but don’t fit into the standard CUDA model (above). My question is, given an algorithm an algorithm like this, could one still make use of the GPU programming by writing a program that runs in parallel on all of the multiprocessors, but only runs one thread per processor? I think that would solve the divergence problem, no? If not, which would be faster – single threads per processor, or just implementing the ‘divergent’ algorithm, and living with the performance hits?
The reason I ask is that I was thinking about trying to implement some searching/sorting algorithms in CUDA, but since they need to do comparisons, there will be divergent branches all over the place when it is run.