I am trying to implement a fast reduction in CUDA, experimenting with different methods.
I have an RTX-2070 and I tried both the classical method on the presentation by Mark Harris, and the one on this link:
Where the author (Justin Luitjens) explains how to speed up the reduction through the __shfl_down intrinsic.
I tried to compare the two methods, but to my surprise, I basically got the same results (speed-wise).
length | warp intrinsics(GB/s) | shared_mem(GB/s)
2^20 | 5.4535 | 5.3383
2^21 | 10.4268 | 10.2254
2^22 | 18.7331 | 18.4934
2^23 | 31.2234 | 31.1320
2^24 | 48.9904 | 48.5354
2^25 | 63.6323 | 64.1023
2^26 | 80.9807 | 80.6243
2^27 | 91.2636 | 91.1508
2^28 | 97.9727 | 97.6832
2^29 | 101.2397 | 101.1321
2^30 | 102.9534 | 97.3306
The post is somehow old, and maybe the two methods are equivalent on modern architectures. Is that so?
Also, the maximum bandwidth achievable seems to be 100 GB/s, while in his post Justin seems to achieve 140 GB/s on a K40, which has a nominal bandwidth almost half of the RTX2070.
I am not sure how to post the code. You can find it here:
Together with the driver, here:
I also tried thrust::reduce, but it gave me similar (slightly worse) results.