Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,.

Jimmy_Pettersson · June 15, 2013, 10:46am

…

CudaaduC · June 21, 2013, 12:29am

When I look at the version of this reduction code which does use __shfl,

float warpSum = warp_reduce_registers<x_threads, NbXGroups>( sum ) ;
	
	// Broadcast warp redudced sums via shared memory
	
	if( threadIdx.x%32 == 0 )
	smem[threadIdx.y][threadIdx.x/32] = warpSum;
	
	__syncthreads();

I am curious about the ‘Broadcast’ step of the warpSum value. I assume this is the ‘faster’ route to filling in the shared memory values. Unlike the usual ‘warpReduce’ step which is done when threadIdx.x<32, this only ‘Broadcasts’ when threadIdx.x is a multiple of warp size. So this is using registers rather than shared memory?

Also why is 64 used as the thread block size, when for Kepler it seems 256 tends to be optimal?

Thanks.

vvolkov · June 21, 2013, 3:42am

This is really great. Happy to see a fast sort. How does it fare on Kepler? Do you think you can beat 32-bit sort too?

Jimmy_Pettersson · June 26, 2013, 7:29am

Have not tested on Kepler yet. The solution does rely quite a bit on atomic operations where Kepler should provide further improvements over Fermi.

I did a implement a 32-bit sorter but by the time of poster submission it was “only” roughly on par with the B40C sorter (Google Code Archive - Long-term storage for Google Code Project Hosting.). Will see if I can contribute anything more there.

CudaaduC · June 16, 2014, 11:48pm

I tested this code on a GTX 780ti and thought to post the results:

GeForce GTX 780 Ti @ 336.000 GB/s

N               [GB/s]          [perc]          [usec]          test
1048576         157.33                  46.82   26.7             Pass
2097152         192.49                  57.29   43.6             Pass
4194304         233.10                  69.38   72.0             Pass
8388608         258.17                  76.84   130.0            Pass
16777216        273.89                  81.51   245.0            Pass
33554432        281.41                  83.75   476.9            Pass
67108864        285.67                  85.02   939.7            Pass
134217728       287.55                  85.58   1867.1

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test
14680102        272.84                  81.20   215.2            Pass
14680119        272.76                  81.18   215.3            Pass
18875600        270.54                  80.52   279.1            Pass
7434886         165.25                  49.18   180.0            Pass
13324075        247.17                  73.56   215.6            Pass
15764213        257.93                  76.76   244.5            Pass
1850154         65.80           19.58   112.5            Pass
4991241         148.23                  44.12   134.7            Pass

Even with the infamous WDDM driver, still good numbers.

MutantJohn · June 17, 2014, 12:02am

Wait, we can use atomic functions with floats now?

CudaaduC · June 17, 2014, 12:05am

Re: the GTX 780ti in Windows → I though there would be a big performance hit due to use of the WDDM driver vs the TCC driver in the Tesla line.

At first there is a bit more overhead with the WDDM, but after that no issue so far.

Not only that somehow my CUDA implementation of next_permutation() which I have tested on a GTX 780ti in linux is actually FASTER in windows 7.

For example generating all permutations of a 13 element array took 8.19 seconds in linux (Ubuntu 12.10), while in W7 it took 6.95 seconds:

Testing raw permutation version.
GPU timing for 13!: 6.954 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu

Odd…

CudaaduC · June 17, 2014, 12:07am

You were always able to use atomics on floats after compute 2.0. This reduction code does not use atomics at all though, I think you are looking at the wrong file.

MutantJohn · June 17, 2014, 12:27am

Are you kidding me? I’ve been converting my floats to ints and… This whole time… Omg… I even had this whole topic here about how to use floats with atomicMin() and no one said that… Oh well, at least I learned how to radix sort?

cbuchner1 · June 17, 2014, 10:12am

I am having trouble accessing the attachments in this thread.

CudaaduC · June 18, 2014, 1:15am

I guess that means that the original reduction code by JP is no longer correctly linked? If you are looking for that code you can PM or I can post, but not sure if JP took it down himself.

robosmith · June 23, 2014, 5:33am

I too, am interested in the source code.

Please post it or PM how to get it. Tnx.

CudaaduC · June 25, 2014, 5:14am

This was the OP’s original code.
IMO the best gauge of bandwidth.

http://pastebin.com/2pVhzeYD

cbuchner1 · June 25, 2014, 11:32am

Thanks for re-posting.

CudaaduC · October 31, 2014, 5:25pm

The GTX 980 numbers:

GeForce GTX 980 @ 224.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         125.25                  55.84   33.5             Pass
 2097152         145.45                  64.84   57.7             Pass
 4194304         159.49                  71.10   105.2            Pass
 8388608         168.04                  74.91   199.7            Pass
 16777216        172.79                  77.03   388.4            Pass
 33554432        175.20                  78.10   766.1            Pass
 67108864        176.43                  78.65   1521.5                   Pass
 134217728       177.06                  78.93   3032.1                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        172.29                  76.80   340.8            Pass
 14680119        172.36                  76.84   340.7            Pass
 18875600        171.25                  76.34   440.9            Pass
 7434886         123.42                  55.02   241.0            Pass
 13324075        161.86                  72.16   329.3            Pass
 15764213        166.43                  74.19   378.9            Pass
 1850154         58.77           26.20   125.9            Pass
 4991241         114.61                  51.09   174.2            Pass
Press any key to continue . . .

The GTX 780ti has is still the king in GBs, both in nominal terms, and as a percentage of theoretical maximum( it was about 86% of theoretical on the same test, though maybe there are some adjustments needed for Maxwell).

Jimmy_Pettersson · November 2, 2014, 8:41pm

Yeah since they changed the forums around the original file was removed.

Looks like it needs a little tweaking for Maxwell perhaps? :-) Will give it a spin on my GTX980 when I have the time to fiddle with it.

Thanks CudaaduC for the helpful re-post of the source.

CudaaduC · January 9, 2015, 3:32am

GTX 980m

GeForce GTX 980M @ 160.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         96.79           60.37   43.3             Pass
 2097152         112.60                  70.24   74.5             Pass
 4194304         128.75                  80.31   130.3            Pass
 8388608         136.75                  85.30   245.4            Pass
 16777216        141.31                  88.14   474.9            Pass
 33554432        143.84                  89.72   933.1            Pass
 67108864        144.86                  90.36   1853.1
 134217728       145.84                  90.97   3681.3

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        140.85                  87.85   416.9            Pass
 14680119        140.79                  87.82   417.1            Pass
 18875600        140.11                  87.39   538.9            Pass
 7434886         99.97           62.35   297.5            Pass
 13324075        132.60                  82.71   401.9            Pass
 15764213        136.11                  84.90   463.3            Pass
 1850154         47.10           29.38   157.1            Pass
 4991241         93.11           58.08   214.4            Pass

In terms of percentage of theoretical maximum, much better than the desktop version GTX 980.

In nominal terms not as good obviously, but overall the GTX 980m is very impressive for a laptop.

Jimmy_Pettersson · January 9, 2015, 11:29pm

90.97% :-)

Would be curious to know the difference between the chips. How wide is memory bus?

CudaaduC · January 10, 2015, 1:34am

256 I believe for the 980m

Using the Cuda-Z utility I get 3.4 Teraflops for 32 bit float and 1.035 TeraIops for 32 bit integers (whatever those values are worth in real-life terms).

njuffa · January 10, 2015, 1:53am

The specifications for the GTX 980M states 1536 CUDA cores at 1.038 GHz base clock:

[url]http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-980m/specifications[/url]

This would result in 3.19 single-precision TFLOPS, but the GPU may apply clock boost during the benchmark run (are there vendor overclocked GPUs in the mobile field? Don’t know). I am not sure how CUDA-Z measures TeraIops. 32-bit IMAD throughput with each IMAD counting as two operations? 32-bit IADD throughput?

Topic		Replies	Views
My simple but speedy reduction code (runs 106.4GB/s on GTX 295) 106.4/111.9=95.1% to the peak bandwi CUDA Programming and Performance	32	28597	August 15, 2010
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30687	October 29, 2010
Faster Parallel Reductions on Kepler Technical Blog	53	2478	September 4, 2021
Sum reduction working in Fermi, Kepler and Maxwell CUDA Programming and Performance	10	3756	February 1, 2016
So what's new about Maxwell? CUDA Programming and Performance	166	57848	March 10, 2015
Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :) CUDA Programming and Performance	88	22940	May 5, 2010
Any advice on adjusting code for Maxwell when coming from Kepler CUDA Programming and Performance	20	2947	November 6, 2014
[Solved] CUDA code works on Kepler but not on Maxwell CUDA Programming and Performance	10	3250	October 5, 2015
GPU Perfomance How much GFlops??? CUDA Programming and Performance	27	37973	August 30, 2009
CUDA reduction CUDA Programming and Performance	10	51570	June 7, 2009

Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,.

Related topics