Speedy general reduction code ( 83.5 % of peak) Works for any size

I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.

I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.25		   71.17 %		 83.5			Pass

 2097152		 53.43		   75.68 %		 157.0		   Pass

 4194304		 54.96		   77.85 %		 305.2		   Pass

 8388608		 57.79		   81.86 %		 580.6		   Pass

 16777216		58.55		   82.93 %		 1146.2				  Pass

 33554432		58.96		   83.51 %		 2276.4				  Fail ( cpu malloc issue ?)

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		57.71		   81.74 %		 1017.5				  Pass

 14680119		57.74		   81.79 %		 1017.0				  Pass

 18875600		58.02		   82.19 %		 1301.2				  Pass

 7434886		 51.17		   72.48 %		 581.2		   Pass

 1501294		 43.93		   62.22 %		 136.7		   Pass

 15052598		50.68		   71.79 %		 1188.0				  Pass

 3135229		 50.58		   71.65 %		 247.9		   Pass

 8422202		 54.93		   77.81 %		 613.3		   Pass

So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? :confused:

Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!

[attachment=23305:my_reduction.cu]

compiled with: nvcc my_reduction.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math

I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.

EDIT: Updated file…

Please publish also your QR code ;-)

Please publish also your QR code ;-)

Hello,

I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this

Hello,

I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

Doesn’t quite do so well on my GTX470:

GTX470 @ 133.90 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 77.12 		 57.59 % 	 54.4 		 Pass 

 2097152 	 90.13 		 67.31 % 	 93.1 		 Pass 

 4194304 	 91.90 		 68.64 % 	 182.6 		 Pass 

 8388608 	 100.14 	 74.79 % 	 335.1 		 Pass 

 16777216 	 105.78 	 79.00 % 	 634.4 		 Pass 

 33554432 	 106.15 	 79.28 % 	 1264.4 	 Fail 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 110.63 	 82.62 % 	 530.8 		 Pass 

 14680119 	 110.66 	 82.64 % 	 530.7 		 Pass 

 18875600 	 97.65 		 72.93 % 	 773.2 		 Pass 

 7434886 	 79.38 		 59.28 % 	 374.6 		 Pass 

 1501294 	 56.49 		 42.19 % 	 106.3 		 Pass 

 15052598 	 78.84 		 58.88 % 	 763.7 		 Pass 

 3135229 	 78.75 		 58.81 % 	 159.2 		 Pass 

 8422203 	 96.67 		 72.20 % 	 348.5 		 Pass

Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.

The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.

Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.

The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.

It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.

I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?

EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…

It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.

I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?

EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…

The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)

The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)

Here’s my warmed up GTX275:

GTS275 @ 140 GB/s

N			   [GB/s]		  [perc]		  [usec]		  test

  1048576		 54.1			38.6			 77.6		  Pass

  2097152		 88.4			63.2			 94.9		  Pass

  4194304		105.4			75.3			159.2		  Pass

  8388608		115.3			82.4			291.0		  Pass

 16777216		111.1			79.3			604.3		  Pass

 33554432		113.2			80.9		   1185.4		  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		128.3			91.6			457.8		  Pass

 14680119		128.4			91.7			457.2		  Pass

 18875600		120.6			86.2			625.9		  Pass

  7434886		 99.2			70.9			299.7		  Pass

  9386455		 98.7			70.5			380.3		  Pass

 16495925		 87.1			62.2			757.8		  Pass

  4280953		102.4			73.2			167.2		  Pass

  8247688		 90.8			64.8			363.4		  Pass

(had to) change the timing back to cutil on my windows box and being lazy.

Here’s my warmed up GTX275:

GTS275 @ 140 GB/s

N			   [GB/s]		  [perc]		  [usec]		  test

  1048576		 54.1			38.6			 77.6		  Pass

  2097152		 88.4			63.2			 94.9		  Pass

  4194304		105.4			75.3			159.2		  Pass

  8388608		115.3			82.4			291.0		  Pass

 16777216		111.1			79.3			604.3		  Pass

 33554432		113.2			80.9		   1185.4		  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		128.3			91.6			457.8		  Pass

 14680119		128.4			91.7			457.2		  Pass

 18875600		120.6			86.2			625.9		  Pass

  7434886		 99.2			70.9			299.7		  Pass

  9386455		 98.7			70.5			380.3		  Pass

 16495925		 87.1			62.2			757.8		  Pass

  4280953		102.4			73.2			167.2		  Pass

  8247688		 90.8			64.8			363.4		  Pass

(had to) change the timing back to cutil on my windows box and being lazy.

Here is an update:

  • Fix mul24 for Fermi ( not tested )

  • Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.48		   71.50 %		 83.1			Pass 0

 2097152		 53.94		   76.40 %		 155.5		   Pass 0

 4194304		 55.81		   79.04 %		 300.6		   Pass 0

 8388608		 57.79		   81.86 %		 580.6		   Pass 0

 16777216		58.71		   83.16 %		 1143.1				  Pass 0

 33554432		59.44		   84.19 %		 2258.1				  Fail 0

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		59.05		   83.64 %		 994.5		   Pass 38

 14680119		59.04		   83.62 %		 994.6		   Pass 55

 18875600		58.51		   82.87 %		 1290.5				  Pass 1232

 7434886		 57.16		   80.96 %		 520.3		   Pass 646

 1501294		 52.02		   73.68 %		 115.4		   Pass 110

 15052598		57.96		   82.09 %		 1038.9				  Pass 1846

 3135229		 53.08		   75.19 %		 236.2		   Pass 1789

 8422202		 57.25		   81.09 %		 588.5		   Pass 826

[attachment=17986:my_reduction.cu]

cheers!

Here is an update:

  • Fix mul24 for Fermi ( not tested )

  • Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

GTS250 @ 70.6 GB/s

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 50.48		   71.50 %		 83.1			Pass 0

 2097152		 53.94		   76.40 %		 155.5		   Pass 0

 4194304		 55.81		   79.04 %		 300.6		   Pass 0

 8388608		 57.79		   81.86 %		 580.6		   Pass 0

 16777216		58.71		   83.16 %		 1143.1				  Pass 0

 33554432		59.44		   84.19 %		 2258.1				  Fail 0

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		59.05		   83.64 %		 994.5		   Pass 38

 14680119		59.04		   83.62 %		 994.6		   Pass 55

 18875600		58.51		   82.87 %		 1290.5				  Pass 1232

 7434886		 57.16		   80.96 %		 520.3		   Pass 646

 1501294		 52.02		   73.68 %		 115.4		   Pass 110

 15052598		57.96		   82.09 %		 1038.9				  Pass 1846

 3135229		 53.08		   75.19 %		 236.2		   Pass 1789

 8422202		 57.25		   81.09 %		 588.5		   Pass 826

[attachment=23304:my_reduction.cu]

cheers!

You can get high performance easier if you load multiple floats at once. Something like:

float a0 = in[0*threads];

float a1 = in[1*threads];

float a2 = in[2*threads];

float a3 = in[3*threads];

sum += a0+a1+a2+a3;

In fact, simple sum += in[0threads]+in[1threads]+in[2threads]+in[3threads] should have same effect.

In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!

Don’t get bogged down by occupancy considerations. Occupancy is overrated.

Vasily

You can get high performance easier if you load multiple floats at once. Something like:

float a0 = in[0*threads];

float a1 = in[1*threads];

float a2 = in[2*threads];

float a3 = in[3*threads];

sum += a0+a1+a2+a3;

In fact, simple sum += in[0threads]+in[1threads]+in[2threads]+in[3threads] should have same effect.

In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!

Don’t get bogged down by occupancy considerations. Occupancy is overrated.

Vasily

Yeah that version is a bit better on Fermi:

GTX470 @ 133.9 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 92.77 		 69.29 % 	 45.2 		 Pass 0 

 2097152 	 101.64 		 75.91 % 	 82.5 		 Pass 0 

 4194304 	 106.68 		 79.67 % 	 157.3 		 Pass 0 

 8388608 	 115.34 		 86.14 % 	 290.9 		 Pass 0 

 16777216 	 121.06 		 90.41 % 	 554.3 		 Pass 0 

 33554432 	 121.84 		 90.99 % 	 1101.6 		 Fail 0 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 121.44 		 90.70 % 	 483.5 		 Pass 38 

 14680119 	 121.46 		 90.71 % 	 483.4 		 Pass 55 

 18875600 	 120.28 		 89.83 % 	 627.7 		 Pass 1232 

 7434886 	 112.23 		 83.82 % 	 265.0 		 Pass 646 

 1501294 	 95.46 		 71.29 % 	 62.9 		 Pass 110 

 15052598 	 114.51 		 85.52 % 	 525.8 		 Pass 1846 

 3135229 	 93.84 		 70.08 % 	 133.6 		 Pass 1789 

 8422203 	 112.70 		 84.17 % 	 298.9 		 Pass 827

About the same on my GTX275 though.