Hello, I was wondering if there was any discernible performance difference between having diverging threads within a warp, and having coherent warps diverge from other warps in a block, and what impact it may have as opposed to having no divergence between any warps.
As I understand it, warps get executed in 1 clock, so having all threads in a warp execute the same instruction is very important for performance. I do not fully understand inter-warp performance and its impact when warps diverge from other warps (When CUDA executes multiple warps, do they also execute all warps in a block in one clock?).
If the divergence is significant (ie. prevalent), I would expect that there is definitely a difference between intra-warp divergence (potentially problematic) and inter-warp divergence (generally not much of a concern, AFAIK). In fact, I think it would be common to say that inter-warp divergence is not really divergence at all, and doesn’t get talked about much (that I know of) because it is generally the expected state of affairs. Warps, even those doing identical stuff, are expected to diverge, at least somewhat, over time.
I wouldn’t say warps execute in one clock. I’m not really sure what you mean by that. I would say that the desirable state for warps is to be executing in lockstep, and without exception, instruction issue in the GPU occurs warp-wide. I’m not sure there are any instructions that execute in one clock (there may be); many GPU instruction handling units are pipelined, and so instructions generally require the pipeline depth in clocks to complete.
Generally speaking all warps in a block do not execute in one clock, nor do they execute in lockstep. There aren’t any architectures that could issue a full threadblock’s complement of warps (32) in a single cycle. And, of course, most SM architectures actually support up to 64 warps per SM. So inter-warp divergence is normal. Even high performing codes may exhibit significant inter-warp divergence. Again, this is so “normal” that in common parlance this does not get labelled as divergence at all, that I know of.
I understand, thanks! I think I got confused by the concept of warp-coherence with respect to all threads in a warp executing the same instructions (so that the GPU can execute 1 instruction over 32 threads in a warp) and how many clocks it actually takes, which is varied per command.
I am considering inter-warp synchronization as a way to deal with some nested for-loops in my kernel by unraveling the loops on a per-warp basis, and using shared memory to reduce the results, and I was concerned about how warp divergence and synchronizing warps would impact performance (whether this would be worthwhile or not).
To extend the question, given that inter-warp divergence is standard behavior and my understanding of __syncthreads() is that it synchronizes all threads in a block, then if I wanted to implement inter-warp synchronization where each warp would become one ‘iteration’ of the for loop im trying to unwrap, I would have to assign the block and grid dim’s appropriately such that there would be 32*n threads in each block, where n would be the number of iterations in my loop I wish to unravel, and that I would have to use __syncthreads() to synchronize each block’s warps in order to perform the reduction step via shared memory. There would also be some performance impact from diverging warps being blocked by the synchronization call. Is this correct?
As a side note, I am also aware that the most recent Volta architecture comes with some new inter-thread scheduling that “enables intra-warp synchronization patterns previously unavailable”, which makes things interesting.
That’s generally true for shared-memory parallel reductions, yes.
also generally true