I see in section 6.1.1.2 of the CUDA manual it is mentioned that control flow statements, like “if” can really reduce throughput if they cause different threads in a warp to have divergent execution paths.
I’d like to understand this issue a little better. Is the control flow problem that each processor within the block might not be executing the same instruction at the same time? Or is the problem that memory reads might not be issued at the same time across the warp and therefore not be coalesced into a larger read?
Specifically I have a kernel in mind which takes a large array of size N=1 million, exponentiates the arguments, and sums them together. (My application isn’t exactly that, but this is a simplified approximation to the problem.) My current implementation has each thread in a block calculate the sum of 1024 of these exponentials in shared memory, thread 0 in the block sums the results from the shared memory and writes it to global memory, and then the CPU reads the handful of values stored in global memory and gets the final sum.
Since computing the exponential takes many instructions, a simple optimization would be to skip computing the exponential if the argument is less than -50, since in my particular usage, that will contribute essentially nothing to the final answer. The problem with this optimization is that the arguments which are being exponentiated are not in any particular order in memory, so each thread will get out of sync as some threads skip exponentials while others do not. However, the probability of skipping an exponential is roughly uniform, so after 1024 sums, each thread will still reach the __syncthreads() call at nearly the same time.
Will this have a disasterous effect on throughput and why? (My GPU will arrive soon and I can test myself, but I’m hoping an expert can help me understand the architecture a little better.)