Strange behaviour for CUDA-reductions

Hello everyone,
I am trying to implement a fast reduction in CUDA, experimenting with different methods.

I have an RTX-2070 and I tried both the classical method on the presentation by Mark Harris, and the one on this link:

Where the author (Justin Luitjens) explains how to speed up the reduction through the __shfl_down intrinsic.

I tried to compare the two methods, but to my surprise, I basically got the same results (speed-wise).

length | warp intrinsics(GB/s) | shared_mem(GB/s)
2^20 | 5.4535 | 5.3383
2^21 | 10.4268 | 10.2254
2^22 | 18.7331 | 18.4934
2^23 | 31.2234 | 31.1320
2^24 | 48.9904 | 48.5354
2^25 | 63.6323 | 64.1023
2^26 | 80.9807 | 80.6243
2^27 | 91.2636 | 91.1508
2^28 | 97.9727 | 97.6832
2^29 | 101.2397 | 101.1321
2^30 | 102.9534 | 97.3306

The post is somehow old, and maybe the two methods are equivalent on modern architectures. Is that so?

Also, the maximum bandwidth achievable seems to be 100 GB/s, while in his post Justin seems to achieve 140 GB/s on a K40, which has a nominal bandwidth almost half of the RTX2070.

I am not sure how to post the code. You can find it here:
https://github.com/giuseros/parallelprimitives/blob/master/reduce.cuh

Together with the driver, here:
https://github.com/giuseros/parallelprimitives/blob/master/test_reduction.cu

Thanks,
Giuseppe

EDIT:
I also tried thrust::reduce, but it gave me similar (slightly worse) results.

What is the run time reported by nvprof for each of these kernels?

Check this, it might have information you want:
[url]c++ - CUDA shuffle instruction reduction slower than shared memory reduction? - Stack Overflow

Thanks for the reply.

I actually rebuilt with -O3, and at least for medium-sizes, I can see the difference.

nvprof is showing the exact same results, and cuda-memcheck doesn’t show any error.

So I think this is really the best performance achievable

Thanks,
Giuseppe