Why does warp-level Reduction256 take 256 cycles on NVIDIA Volta architecture?

202476410arsmart · November 26, 2024, 12:58pm

For example,
on the NVIDIA Volta architecture only 32 warp shuffle operations
can be performed within a clock cycle per SM.

For example, consider performing a warp-level Reduction256:
the warp-level reduction shown in Listing 2 requires 8 iterations
of 32 element reduction to reduce each segment. The total cycles
is therefore 256, since each shuffle instruction and addition takes
4 cycles.

I see above content from paper: Accelerating Reduction and Scan Using Tensor Core Units

My confusion:

Why does the total cycle count for this reduction equal 256?
How are the cycles accounted for considering the 8 iterations per warp and the architectural constraints of shuffle operations on Volta?
How to calculate the theoretical cycle count for Tensor Core and shuffle operations when multiple warps are working simultaneously?

Could someone provide a detailed breakdown of how these 256 cycles are derived?

Curefab · November 26, 2024, 2:14pm

How does the 256-reduction work? Do you have 256 threads or 32 threads with 8 values each?

202476410arsmart · November 26, 2024, 2:18pm

Yes! You are right!

Topic		Replies	Views
Execution of a warp CUDA Programming and Performance	0	490	November 28, 2013
How to structure multiple concurrent shuffles in the same block? CUDA Programming and Performance	2	910	February 27, 2019
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15835	February 4, 2011
Is there an efficient pattern for multiple independent reductions in a single warp? CUDA Programming and Performance	7	1160	March 14, 2016
The cycles cost by _ballot api and popcll api CUDA Programming and Performance	4	74	May 14, 2025
Faster Parallel Reductions on Kepler Technical Blog	53	2440	September 4, 2021
Cuda operations along side Tensor operations CUDA Programming and Performance	2	536	October 12, 2021
__shfl_down_sync weird behavior CUDA Programming and Performance cuda , ubuntu	5	1592	November 23, 2021
Use __shfl_down_sync for multiple variables? CUDA Programming and Performance	2	847	July 7, 2022
Warp Size Question CUDA Programming and Performance	21	14273	June 18, 2010

Why does warp-level Reduction256 take 256 cycles on NVIDIA Volta architecture?

Related topics