Why threadFencingReduction is faster than Reduction6 in CUDA Sample?

Hi all,
I’m learning cuda reduction implementation now and I find Mark Harris’ slides: https://developer.download.nvidia.cn/assets/cuda/files/reduction.pdf
This is really good tutorial about practial implementation of reduction. But Mark didn’t mention threadFenceReduction in the slides.
I tried to run reduction code in cuda/samples/6_Advanced/reduction and got the following results:

./reduction --kernel=6 -n=33554432  --threads=128 --type=float
./reduction Starting...

GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

Using Device 0: NVIDIA Tegra X2

Reducing array of type float

33554432 elements
128 threads (max)
64 blocks

Reduction, Throughput = 18.2663 GB/s, Time = 0.00735 s, Size = 33554432 Elements, NumDevsUsed = 1, Workgroup = 128

GPU result = 1.99240136
CPU result = 1.99240136

Test passed

And I also tried cuda/samples/6_Advanced/threadFenceReduction:

./threadFenceReduction -n=33554432
threadFenceReduction Starting...

GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2

GPU Device supports SM 6.2 compute capability

33554432 elements
128 threads (max)
64 blocks
Average time: 3.398510 ms
Bandwidth:    39.493107 GB/s

GPU result = 1.992401361465
CPU result = 1.992401361465

threadFenceReduction doubles the speed of Reduction6, why? I read the code, but I can’t see the difference that can lead to such huge improvement of performance. What makes the difference?


One difference is that threadFence reduction only requires one kernel call. reduction6 requires at least 2.

You may want to ask these questions on the TX2 forum. When I run the comparison on Tesla V100, I witness a small difference (~20%) between the two test cases.

This is also a matter of how many blocks you are using (which in reduction6 correlates to the amount of data reduced).

threadFencingReduction trades off a constant time saving (not having to launch an extra kernel) for extra work done in each block. So with increasing number of blocks at some point running an extra kernel will be cheaper.

OK, Thank you Robert and tera.