I’m learning cuda reduction implementation now and I find Mark Harris’ slides: https://developer.download.nvidia.cn/assets/cuda/files/reduction.pdf
This is really good tutorial about practial implementation of reduction. But Mark didn’t mention threadFenceReduction in the slides.
I tried to run reduction code in cuda/samples/6_Advanced/reduction and got the following results:
./reduction --kernel=6 -n=33554432 --threads=128 --type=float ./reduction Starting... GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2 Using Device 0: NVIDIA Tegra X2 Reducing array of type float 33554432 elements 128 threads (max) 64 blocks Reduction, Throughput = 18.2663 GB/s, Time = 0.00735 s, Size = 33554432 Elements, NumDevsUsed = 1, Workgroup = 128 GPU result = 1.99240136 CPU result = 1.99240136 Test passed
And I also tried cuda/samples/6_Advanced/threadFenceReduction:
./threadFenceReduction -n=33554432 threadFenceReduction Starting... GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2 GPU Device supports SM 6.2 compute capability 33554432 elements 128 threads (max) 64 blocks Average time: 3.398510 ms Bandwidth: 39.493107 GB/s GPU result = 1.992401361465 CPU result = 1.992401361465
threadFenceReduction doubles the speed of Reduction6, why? I read the code, but I can’t see the difference that can lead to such huge improvement of performance. What makes the difference?