Performance of Divergent Threads

Most of the programming focus here is to focus on many parallel threads grouped in blocks that are passed off to the multiprocessors as necessary. However, the programming literature (and example projects) talk about the performance hits of divergent threads inside a warp.

Unfortunately, there are plenty of algorithms that could easily parallelized, but don’t fit into the standard CUDA model (above). My question is, given an algorithm an algorithm like this, could one still make use of the GPU programming by writing a program that runs in parallel on all of the multiprocessors, but only runs one thread per processor? I think that would solve the divergence problem, no? If not, which would be faster – single threads per processor, or just implementing the ‘divergent’ algorithm, and living with the performance hits?

The reason I ask is that I was thinking about trying to implement some searching/sorting algorithms in CUDA, but since they need to do comparisons, there will be divergent branches all over the place when it is run.

A completely divergent warp will not be slower than a single thread per MP. It’d probably be quite faster, especially if you take some real-world considerations into account (eg, several warps per MP, the divergence isn’t 100%, etc).

The waters get even muddier when you start thinking about idle compute resources in memory bound applications. If the MP is just sitting around waiting for memory reads to happen, you may be able to get the divergence “for free”. Still, I don’t pretend to understand or to be able to explain it all, but I can offer a few words of advice:

Optimize for divergent warps last and never fail to attempt a bit of code simply because you fear that divergent warps will slow performance. I know of many algorithms implemented in CUDA in the scientific community with a high number of divergent warps that still perform extremely well (speedups of 20 to 100x).