Strange behaviour for CUDA-reductions

giuseros85 · January 9, 2019, 10:50pm

Hello everyone,
I am trying to implement a fast reduction in CUDA, experimenting with different methods.

I have an RTX-2070 and I tried both the classical method on the presentation by Mark Harris, and the one on this link:

Where the author (Justin Luitjens) explains how to speed up the reduction through the __shfl_down intrinsic.

I tried to compare the two methods, but to my surprise, I basically got the same results (speed-wise).

length | warp intrinsics(GB/s) | shared_mem(GB/s)
2^20 | 5.4535 | 5.3383
2^21 | 10.4268 | 10.2254
2^22 | 18.7331 | 18.4934
2^23 | 31.2234 | 31.1320
2^24 | 48.9904 | 48.5354
2^25 | 63.6323 | 64.1023
2^26 | 80.9807 | 80.6243
2^27 | 91.2636 | 91.1508
2^28 | 97.9727 | 97.6832
2^29 | 101.2397 | 101.1321
2^30 | 102.9534 | 97.3306

The post is somehow old, and maybe the two methods are equivalent on modern architectures. Is that so?

Also, the maximum bandwidth achievable seems to be 100 GB/s, while in his post Justin seems to achieve 140 GB/s on a K40, which has a nominal bandwidth almost half of the RTX2070.

I am not sure how to post the code. You can find it here:
https://github.com/giuseros/parallelprimitives/blob/master/reduce.cuh

Together with the driver, here:
https://github.com/giuseros/parallelprimitives/blob/master/test_reduction.cu

Thanks,
Giuseppe

EDIT:
I also tried thrust::reduce, but it gave me similar (slightly worse) results.

saulocpp · January 10, 2019, 2:12pm

What is the run time reported by nvprof for each of these kernels?

Check this, it might have information you want:
[url]c++ - CUDA shuffle instruction reduction slower than shared memory reduction? - Stack Overflow

giuseros85 · January 11, 2019, 1:34am

Thanks for the reply.

I actually rebuilt with -O3, and at least for medium-sizes, I can see the difference.

nvprof is showing the exact same results, and cuda-memcheck doesn’t show any error.

So I think this is really the best performance achievable

Thanks,
Giuseppe

Topic		Replies	Views
Parallel reduction not as fast as nVidia's no idea why - can anyone figure this one out? CUDA Programming and Performance	2	2302	August 12, 2009
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	147	July 7, 2024
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1091	July 29, 2010
Comparison of a CUDA kernel performance running on different GPUs/Toolkits/Drivers CUDA Programming and Performance	2	926	July 7, 2014
CUDA very slow performance CUDA Programming and Performance	21	16318	March 6, 2020
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28214	August 15, 2010
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	7689	July 22, 2010
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1791	January 14, 2009
Speed-ups for Reduction CUDA Programming and Performance	2	1572	October 14, 2008
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7886	August 16, 2007

Strange behaviour for CUDA-reductions

Related topics