Speedy general reduction code ( 83.5 % of peak) Works for any size

Yeah that version is a bit better on Fermi:

GTX470 @ 133.9 GB/s

 N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 1048576 	 92.77 		 69.29 % 	 45.2 		 Pass 0 

 2097152 	 101.64 		 75.91 % 	 82.5 		 Pass 0 

 4194304 	 106.68 		 79.67 % 	 157.3 		 Pass 0 

 8388608 	 115.34 		 86.14 % 	 290.9 		 Pass 0 

 16777216 	 121.06 		 90.41 % 	 554.3 		 Pass 0 

 33554432 	 121.84 		 90.99 % 	 1101.6 		 Fail 0 

Non-base 2 tests! 

N 		 [GB/s] 	 [perc] 	 [usec] 	 test 

 14680102 	 121.44 		 90.70 % 	 483.5 		 Pass 38 

 14680119 	 121.46 		 90.71 % 	 483.4 		 Pass 55 

 18875600 	 120.28 		 89.83 % 	 627.7 		 Pass 1232 

 7434886 	 112.23 		 83.82 % 	 265.0 		 Pass 646 

 1501294 	 95.46 		 71.29 % 	 62.9 		 Pass 110 

 15052598 	 114.51 		 85.52 % 	 525.8 		 Pass 1846 

 3135229 	 93.84 		 70.08 % 	 133.6 		 Pass 1789 

 8422203 	 112.70 		 84.17 % 	 298.9 		 Pass 827

About the same on my GTX275 though.

Yes, my experience has also been that occupancy is very overrated ( some of your work inspired this… ). :)

I’m not fully sure that i understand the difference with what you are suggesting and what is already happening in the code:

//in reduce kernel

const int iters = els_per_block/threads;

#pragma unroll

for(int i = 0; i < iters; i++)

{

	sum += in[tid + i*threads];

}

=>

sum += in[tid + 0];

sum += in[tid + 1*threads];

sum += in[tid + 2*threads];

sum += in[tid + 3*threads];

.......

Doesn’t that yield the same results? External Media Will play with the code again later today, maybe i will figure it out :)

Thanks for the tips! There is probably a lot more left to optimize here!

Yes, my experience has also been that occupancy is very overrated ( some of your work inspired this… ). :)

I’m not fully sure that i understand the difference with what you are suggesting and what is already happening in the code:

//in reduce kernel

const int iters = els_per_block/threads;

#pragma unroll

for(int i = 0; i < iters; i++)

{

	sum += in[tid + i*threads];

}

=>

sum += in[tid + 0];

sum += in[tid + 1*threads];

sum += in[tid + 2*threads];

sum += in[tid + 3*threads];

.......

Doesn’t that yield the same results? External Media Will play with the code again later today, maybe i will figure it out :)

Thanks for the tips! There is probably a lot more left to optimize here!

oh, you are right. For some reason I was looking at the reduce_dynamic kernel %-)

I am surprised then that you do a fixed amount of work in the reduce kernel and launch a problem-dependent number of blocks. I’d do it other way around - I’d launch a fixed number of blocks that would run for a problem-dependent amount of iterations. This is the common way to do the work on the CPU - you simply divide the total work between, say, 4 or 8 threads if running on quadcore. The advantage is that most of the summation is done sequentially. I guess the problem with this approach is that you’d have to tune the optimal number of blocks for every GPU individually. Also, I doubt it would run faster than your current code :)

Vasily

oh, you are right. For some reason I was looking at the reduce_dynamic kernel %-)

I am surprised then that you do a fixed amount of work in the reduce kernel and launch a problem-dependent number of blocks. I’d do it other way around - I’d launch a fixed number of blocks that would run for a problem-dependent amount of iterations. This is the common way to do the work on the CPU - you simply divide the total work between, say, 4 or 8 threads if running on quadcore. The advantage is that most of the summation is done sequentially. I guess the problem with this approach is that you’d have to tune the optimal number of blocks for every GPU individually. Also, I doubt it would run faster than your current code :)

Vasily

Yes, this is the “trick” i use in this implementation to be able to unroll those for loops ( ie i want to know the number of iterations at compile time ). The final reduce_dynamic just takes care of the scraps, which i called the “tail” in the code.

The “tail” is very small in relation to the whole problem which means that the runtime of the ineffective reduce_dynamic kernel becomes insignificant ( there might be room for tweaking here though).

Yes, this is the “trick” i use in this implementation to be able to unroll those for loops ( ie i want to know the number of iterations at compile time ). The final reduce_dynamic just takes care of the scraps, which i called the “tail” in the code.

The “tail” is very small in relation to the whole problem which means that the runtime of the ineffective reduce_dynamic kernel becomes insignificant ( there might be room for tweaking here though).

You can unroll loops in a somewhat more classical way, say:

int i = 0;

for(; i < n-256; i += 256 )

{

#pragma unroll

	for( int j = 0; j < 256; j++ )

		sum += in[tid + (i+j)*threads];

}

for(; i < n; i++ )

   sum += in[tid + i*threads];

You can unroll loops in a somewhat more classical way, say:

int i = 0;

for(; i < n-256; i += 256 )

{

#pragma unroll

	for( int j = 0; j < 256; j++ )

		sum += in[tid + (i+j)*threads];

}

for(; i < n; i++ )

   sum += in[tid + i*threads];

[quote name=‘Jimmy Pettersson’ post=‘1105489’ date=‘Aug 18 2010, 10:48 AM’]

Here is an update:

  • Fix mul24 for Fermi ( not tested )

  • Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

[codebox] [font="Courier New"]GTX275 @ 140? GB/s

N [GB/s] [perc] [usec] test

1048576 65.8 47.0 63.7 Pass 0

2097152 99.6 71.2 84.2 Pass 0

4194304 110.9 79.2 151.3 Pass 0

8388608 122.4 87.4 274.1 Pass 0

16777216 128.1 91.5 523.9 Pass 0

33554432 130.5 93.2 1028.8 Pass 0

Non-base 2 tests!

N [GB/s] [perc] [usec] test

14680102 127.7 91.2 459.7 Pass 38

14680119 127.8 91.3 459.3 Pass 55

18875600 128.6 91.8 587.2 Pass 1232

7434886 119.0 85.0 249.8 Pass 646

9386455 117.5 83.9 319.5 Pass 471

16495925 110.5 78.9 597.2 Pass 1333

4280953 108.5 77.5 157.8 Pass 633

8247688 115.8 82.7 284.8 Pass 392[/font][/codebox]

[quote name=‘Jimmy Pettersson’ post=‘1105489’ date=‘Aug 18 2010, 10:48 AM’]

Here is an update:

  • Fix mul24 for Fermi ( not tested )

  • Doubled occupancy ( should also help Fermi a lot)

It now achieves 84 % on my card and I seem to be getting better overall results.

[codebox] [font="Courier New"]GTX275 @ 140? GB/s

N [GB/s] [perc] [usec] test

1048576 65.8 47.0 63.7 Pass 0

2097152 99.6 71.2 84.2 Pass 0

4194304 110.9 79.2 151.3 Pass 0

8388608 122.4 87.4 274.1 Pass 0

16777216 128.1 91.5 523.9 Pass 0

33554432 130.5 93.2 1028.8 Pass 0

Non-base 2 tests!

N [GB/s] [perc] [usec] test

14680102 127.7 91.2 459.7 Pass 38

14680119 127.8 91.3 459.3 Pass 55

18875600 128.6 91.8 587.2 Pass 1232

7434886 119.0 85.0 249.8 Pass 646

9386455 117.5 83.9 319.5 Pass 471

16495925 110.5 78.9 597.2 Pass 1333

4280953 108.5 77.5 157.8 Pass 633

8247688 115.8 82.7 284.8 Pass 392[/font][/codebox]

Isn’t that what you get with [font=“Courier New”]#pragma unroll 256[/font] anyway? What is gained by treating the remainder explicitly?

Isn’t that what you get with [font=“Courier New”]#pragma unroll 256[/font] anyway? What is gained by treating the remainder explicitly?

I’m so retarded, who calls the remainder “tail”??? idjit! :)

Anyways if you treat the remainder explicitly you can for example have predefined block sizes ( how many elements each block processes) which may be computed much faster. If the remainder is small enough it’s compute time will be relatively small even if this is done less efficient.

I’m so retarded, who calls the remainder “tail”??? idjit! :)

Anyways if you treat the remainder explicitly you can for example have predefined block sizes ( how many elements each block processes) which may be computed much faster. If the remainder is small enough it’s compute time will be relatively small even if this is done less efficient.

I find that quite descriptive: the tail is what remains after chopping off the body… :)

I find that quite descriptive: the tail is what remains after chopping off the body… :)

:)

Yeah som signal processing guys I’ve been talking to use that term a lot as well

:)

Yeah som signal processing guys I’ve been talking to use that term a lot as well

If this works it should be slitghtly more effective and i guess one could fix #blocks to some optimal number as you suggested. I must admit i wasn’t aware that one could unroll that :)

Will give it a spin later!

The only objection that comes to mind is that guess if one is unlucky the remainder will be something like 255 els * #blocks which is larger than the original remainder?

Thanks!