Speedy general reduction code ( 83.5 % of peak) Works for any size

Jimmy_Pettersson · August 17, 2010, 1:32pm

I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.

I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.25		   71.17 %		 83.5			Pass

 2097152		 53.43		   75.68 %		 157.0		   Pass

 4194304		 54.96		   77.85 %		 305.2		   Pass

 8388608		 57.79		   81.86 %		 580.6		   Pass

 16777216		58.55		   82.93 %		 1146.2				  Pass

 33554432		58.96		   83.51 %		 2276.4				  Fail ( cpu malloc issue ?)

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		57.71		   81.74 %		 1017.5				  Pass

 14680119		57.74		   81.79 %		 1017.0				  Pass

 18875600		58.02		   82.19 %		 1301.2				  Pass

 7434886		 51.17		   72.48 %		 581.2		   Pass

 1501294		 43.93		   62.22 %		 136.7		   Pass

 15052598		50.68		   71.79 %		 1188.0				  Pass

 3135229		 50.58		   71.65 %		 247.9		   Pass

 8422202		 54.93		   77.81 %		 613.3		   Pass

So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Image

Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!

[attachment=23305:my_reduction.cu]

compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math

I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.

EDIT: Updated file…

vvolkov · August 17, 2010, 2:27pm

Please publish also your QR code ;-)

vvolkov · August 17, 2010, 2:27pm

Please publish also your QR code ;-)

Cuda_Libre · August 17, 2010, 2:52pm

Hello,

I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this

Cuda_Libre · August 17, 2010, 2:52pm

Hello,

I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this

avidday · August 17, 2010, 2:52pm

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

avidday · August 17, 2010, 2:52pm

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

Jimmy_Pettersson · August 17, 2010, 3:09pm

Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.

The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.

Jimmy_Pettersson · August 17, 2010, 3:09pm

Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.

The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.

Jimmy_Pettersson · August 17, 2010, 3:15pm

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.

I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?

EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…

Jimmy_Pettersson · August 17, 2010, 3:15pm

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.

I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?

EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…

Jimmy_Pettersson · August 17, 2010, 7:43pm

The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)

Jimmy_Pettersson · August 17, 2010, 7:43pm

The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)

jan.heckman · August 17, 2010, 10:55pm

I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.

I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.
GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.25		   71.17 %		 83.5			Pass

 2097152		 53.43		   75.68 %		 157.0		   Pass

 4194304		 54.96		   77.85 %		 305.2		   Pass

 8388608		 57.79		   81.86 %		 580.6		   Pass

 16777216		58.55		   82.93 %		 1146.2				  Pass

 33554432		58.96		   83.51 %		 2276.4				  Fail ( cpu malloc issue ?)

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		57.71		   81.74 %		 1017.5				  Pass

 14680119		57.74		   81.79 %		 1017.0				  Pass

 18875600		58.02		   82.19 %		 1301.2				  Pass

 7434886		 51.17		   72.48 %		 581.2		   Pass

 1501294		 43.93		   62.22 %		 136.7		   Pass

 15052598		50.68		   71.79 %		 1188.0				  Pass

 3135229		 50.58		   71.65 %		 247.9		   Pass

 8422202		 54.93		   77.81 %		 613.3		   Pass
So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Media

Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!

[attachment=23296:my_reduction.cu]

compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math

I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.

Here’s my warmed up GTX275:

GTS275 @ 140 GB/s

N			   [GB/s]		  [perc]		  [usec]		  test

  1048576		 54.1			38.6			 77.6		  Pass

  2097152		 88.4			63.2			 94.9		  Pass

  4194304		105.4			75.3			159.2		  Pass

  8388608		115.3			82.4			291.0		  Pass

 16777216		111.1			79.3			604.3		  Pass

 33554432		113.2			80.9		   1185.4		  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		128.3			91.6			457.8		  Pass

 14680119		128.4			91.7			457.2		  Pass

 18875600		120.6			86.2			625.9		  Pass

  7434886		 99.2			70.9			299.7		  Pass

  9386455		 98.7			70.5			380.3		  Pass

 16495925		 87.1			62.2			757.8		  Pass

  4280953		102.4			73.2			167.2		  Pass

  8247688		 90.8			64.8			363.4		  Pass

(had to) change the timing back to cutil on my windows box and being lazy.

jan.heckman · August 17, 2010, 10:55pm

I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.

I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.
GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.25		   71.17 %		 83.5			Pass

 2097152		 53.43		   75.68 %		 157.0		   Pass

 4194304		 54.96		   77.85 %		 305.2		   Pass

 8388608		 57.79		   81.86 %		 580.6		   Pass

 16777216		58.55		   82.93 %		 1146.2				  Pass

 33554432		58.96		   83.51 %		 2276.4				  Fail ( cpu malloc issue ?)

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		57.71		   81.74 %		 1017.5				  Pass

 14680119		57.74		   81.79 %		 1017.0				  Pass

 18875600		58.02		   82.19 %		 1301.2				  Pass

 7434886		 51.17		   72.48 %		 581.2		   Pass

 1501294		 43.93		   62.22 %		 136.7		   Pass

 15052598		50.68		   71.79 %		 1188.0				  Pass

 3135229		 50.58		   71.65 %		 247.9		   Pass

 8422202		 54.93		   77.81 %		 613.3		   Pass
So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Media

Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!

[attachment=23296:my_reduction.cu]

compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math

I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.

Here’s my warmed up GTX275:

GTS275 @ 140 GB/s

N			   [GB/s]		  [perc]		  [usec]		  test

  1048576		 54.1			38.6			 77.6		  Pass

  2097152		 88.4			63.2			 94.9		  Pass

  4194304		105.4			75.3			159.2		  Pass

  8388608		115.3			82.4			291.0		  Pass

 16777216		111.1			79.3			604.3		  Pass

 33554432		113.2			80.9		   1185.4		  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		128.3			91.6			457.8		  Pass

 14680119		128.4			91.7			457.2		  Pass

 18875600		120.6			86.2			625.9		  Pass

  7434886		 99.2			70.9			299.7		  Pass

  9386455		 98.7			70.5			380.3		  Pass

 16495925		 87.1			62.2			757.8		  Pass

  4280953		102.4			73.2			167.2		  Pass

  8247688		 90.8			64.8			363.4		  Pass

(had to) change the timing back to cutil on my windows box and being lazy.

Jimmy_Pettersson · August 18, 2010, 8:48am

Here is an update:

Fix mul24 for Fermi ( not tested )
Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.48		   71.50 %		 83.1			Pass 0

 2097152		 53.94		   76.40 %		 155.5		   Pass 0

 4194304		 55.81		   79.04 %		 300.6		   Pass 0

 8388608		 57.79		   81.86 %		 580.6		   Pass 0

 16777216		58.71		   83.16 %		 1143.1				  Pass 0

 33554432		59.44		   84.19 %		 2258.1				  Fail 0

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		59.05		   83.64 %		 994.5		   Pass 38

 14680119		59.04		   83.62 %		 994.6		   Pass 55

 18875600		58.51		   82.87 %		 1290.5				  Pass 1232

 7434886		 57.16		   80.96 %		 520.3		   Pass 646

 1501294		 52.02		   73.68 %		 115.4		   Pass 110

 15052598		57.96		   82.09 %		 1038.9				  Pass 1846

 3135229		 53.08		   75.19 %		 236.2		   Pass 1789

 8422202		 57.25		   81.09 %		 588.5		   Pass 826

[attachment=17986:my_reduction.cu]

cheers!

Jimmy_Pettersson · August 18, 2010, 8:48am

Here is an update:

Fix mul24 for Fermi ( not tested )
Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.48		   71.50 %		 83.1			Pass 0

 2097152		 53.94		   76.40 %		 155.5		   Pass 0

 4194304		 55.81		   79.04 %		 300.6		   Pass 0

 8388608		 57.79		   81.86 %		 580.6		   Pass 0

 16777216		58.71		   83.16 %		 1143.1				  Pass 0

 33554432		59.44		   84.19 %		 2258.1				  Fail 0

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		59.05		   83.64 %		 994.5		   Pass 38

 14680119		59.04		   83.62 %		 994.6		   Pass 55

 18875600		58.51		   82.87 %		 1290.5				  Pass 1232

 7434886		 57.16		   80.96 %		 520.3		   Pass 646

 1501294		 52.02		   73.68 %		 115.4		   Pass 110

 15052598		57.96		   82.09 %		 1038.9				  Pass 1846

 3135229		 53.08		   75.19 %		 236.2		   Pass 1789

 8422202		 57.25		   81.09 %		 588.5		   Pass 826

[attachment=23304:my_reduction.cu]

cheers!

vvolkov · August 18, 2010, 9:20am

You can get high performance easier if you load multiple floats at once. Something like:

float a0 = in[0*threads];

float a1 = in[1*threads];

float a2 = in[2*threads];

float a3 = in[3*threads];

sum += a0+a1+a2+a3;

In fact, simple sum += in[0threads]+in[1threads]+in[2threads]+in[3threads] should have same effect.

In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!

Don’t get bogged down by occupancy considerations. Occupancy is overrated.

Vasily

vvolkov · August 18, 2010, 9:20am

You can get high performance easier if you load multiple floats at once. Something like:

float a0 = in[0*threads];

float a1 = in[1*threads];

float a2 = in[2*threads];

float a3 = in[3*threads];

sum += a0+a1+a2+a3;

In fact, simple sum += in[0threads]+in[1threads]+in[2threads]+in[3threads] should have same effect.

In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!

Don’t get bogged down by occupancy considerations. Occupancy is overrated.

Vasily

avidday · August 18, 2010, 9:22am

Yeah that version is a bit better on Fermi:

GTX470 @ 133.9 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 92.77 		 69.29 % 	 45.2 		 Pass 0 

 2097152 	 101.64 		 75.91 % 	 82.5 		 Pass 0 

 4194304 	 106.68 		 79.67 % 	 157.3 		 Pass 0 

 8388608 	 115.34 		 86.14 % 	 290.9 		 Pass 0 

 16777216 	 121.06 		 90.41 % 	 554.3 		 Pass 0 

 33554432 	 121.84 		 90.99 % 	 1101.6 		 Fail 0 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 121.44 		 90.70 % 	 483.5 		 Pass 38 

 14680119 	 121.46 		 90.71 % 	 483.4 		 Pass 55 

 18875600 	 120.28 		 89.83 % 	 627.7 		 Pass 1232 

 7434886 	 112.23 		 83.82 % 	 265.0 		 Pass 646 

 1501294 	 95.46 		 71.29 % 	 62.9 		 Pass 110 

 15052598 	 114.51 		 85.52 % 	 525.8 		 Pass 1846 

 3135229 	 93.84 		 70.08 % 	 133.6 		 Pass 1789 

 8422203 	 112.70 		 84.17 % 	 298.9 		 Pass 827

About the same on my GTX275 though.

Topic		Replies	Views
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28644	August 15, 2010
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3764	February 1, 2016
Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,. CUDA Programming and Performance	53	15429	March 24, 2018
Faster Parallel Reductions on Kepler Technical Blog	53	2562	September 4, 2021
Would like to share my speedy reduction code Very simple code! CUDA Programming and Performance	0	1134	July 29, 2010
CUDA reduction CUDA Programming and Performance	10	51580	June 7, 2009
Understanding and adjusting Mark Harris's array reduction CUDA Programming and Performance	11	4726	August 26, 2018
2D reduction using CUDA The use a cuda and cublas library for a 2D simple reduction CUDA Programming and Performance	11	4621	February 7, 2012
how to syncthreads between more than 512 threads CUDA Programming and Performance	14	6684	April 13, 2009
Paralel Reduction With less than 8000 values CUDA Programming and Performance	27	8063	July 22, 2010

Speedy general reduction code ( 83.5 % of peak) Works for any size

Related topics