Warp-level operation cost

Assume that there are no warp divergence or other strange conditions. Can I just seen __syncwarp() as a zero-cost operation? Can __shfl_sync() also be seen as a zero-cost operation?

__shfl_sync() has reduced throughput, relative to many other instructions, the amount varying with architecture. See here.

Also shfl uses some port hardware from the shared memory as far as I know, so it depends on what other instructions are executed at the same time.

This is generally true for Cuda: As long as you have a good instruction mix, you get a good overall performance, with mono-cultures of instructions you are creating a bottleneck, which may or may not be critical for your application.

So some __shfl_sync() are no problem; if you have a huge amount of them, they can be the limiting factor.

__syncwarp() is quite cheap and fast to do, but it is still an instruction, which has to be issued. So for a very fast program it could be limiting.

coupled with these other comments, when a sync operation is necessary for correctness, there is often no alternative. So the cost, whatever it may be, is necessary.

Thanks!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.