For example,
on the NVIDIA Volta architecture only 32 warp shuffle operations
can be performed within a clock cycle per SM.
For example, consider performing a warp-level Reduction256:
the warp-level reduction shown in Listing 2 requires 8 iterations
of 32 element reduction to reduce each segment. The total cycles
is therefore 256, since each shuffle instruction and addition takes
4 cycles.
I see above content from paper: Accelerating Reduction and Scan Using Tensor Core Units
My confusion:
- Why does the total cycle count for this reduction equal 256?
- How are the cycles accounted for considering the 8 iterations per warp and the architectural constraints of shuffle operations on Volta?
- How to calculate the theoretical cycle count for Tensor Core and shuffle operations when multiple warps are working simultaneously?
Could someone provide a detailed breakdown of how these 256 cycles are derived?