Why does warp-level Reduction256 take 256 cycles on NVIDIA Volta architecture?

For example,
on the NVIDIA Volta architecture only 32 warp shuffle operations
can be performed within a clock cycle per SM.

For example, consider performing a warp-level Reduction256:
the warp-level reduction shown in Listing 2 requires 8 iterations
of 32 element reduction to reduce each segment. The total cycles
is therefore 256, since each shuffle instruction and addition takes
4 cycles.

I see above content from paper: Accelerating Reduction and Scan Using Tensor Core Units

My confusion:

  • Why does the total cycle count for this reduction equal 256?
  • How are the cycles accounted for considering the 8 iterations per warp and the architectural constraints of shuffle operations on Volta?
  • How to calculate the theoretical cycle count for Tensor Core and shuffle operations when multiple warps are working simultaneously?

Could someone provide a detailed breakdown of how these 256 cycles are derived?

How does the 256-reduction work? Do you have 256 threads or 32 threads with 8 values each?

1 Like

Yes! You are right!