Hi all,
I’m learning cuda reduction implementation now and I find Mark Harris’ slides: https://developer.download.nvidia.cn/assets/cuda/files/reduction.pdf
This is really good tutorial about practial implementation of reduction. But Mark didn’t mention threadFenceReduction in the slides.
I tried to run reduction code in cuda/samples/6_Advanced/reduction and got the following results:
./reduction --kernel=6 -n=33554432 --threads=128 --type=float
./reduction Starting...
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2
Using Device 0: NVIDIA Tegra X2
Reducing array of type float
33554432 elements
128 threads (max)
64 blocks
Reduction, Throughput = 18.2663 GB/s, Time = 0.00735 s, Size = 33554432 Elements, NumDevsUsed = 1, Workgroup = 128
GPU result = 1.99240136
CPU result = 1.99240136
Test passed
And I also tried cuda/samples/6_Advanced/threadFenceReduction:
./threadFenceReduction -n=33554432
threadFenceReduction Starting...
GPU Device 0: "NVIDIA Tegra X2" with compute capability 6.2
GPU Device supports SM 6.2 compute capability
33554432 elements
128 threads (max)
64 blocks
Average time: 3.398510 ms
Bandwidth: 39.493107 GB/s
GPU result = 1.992401361465
CPU result = 1.992401361465
threadFenceReduction doubles the speed of Reduction6, why? I read the code, but I can’t see the difference that can lead to such huge improvement of performance. What makes the difference?
Thanks!