Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,.

When I look at the version of this reduction code which does use __shfl,

float warpSum = warp_reduce_registers<x_threads, NbXGroups>( sum ) ;
	
	// Broadcast warp redudced sums via shared memory
	
	if( threadIdx.x%32 == 0 )
	smem[threadIdx.y][threadIdx.x/32] = warpSum;
	
	__syncthreads();

I am curious about the ‘Broadcast’ step of the warpSum value. I assume this is the ‘faster’ route to filling in the shared memory values. Unlike the usual ‘warpReduce’ step which is done when threadIdx.x<32, this only ‘Broadcasts’ when threadIdx.x is a multiple of warp size. So this is using registers rather than shared memory?

Also why is 64 used as the thread block size, when for Kepler it seems 256 tends to be optimal?

Thanks.

This is really great. Happy to see a fast sort. How does it fare on Kepler? Do you think you can beat 32-bit sort too?

Have not tested on Kepler yet. The solution does rely quite a bit on atomic operations where Kepler should provide further improvements over Fermi.

I did a implement a 32-bit sorter but by the time of poster submission it was “only” roughly on par with the B40C sorter (Google Code Archive - Long-term storage for Google Code Project Hosting.). Will see if I can contribute anything more there.

I tested this code on a GTX 780ti and thought to post the results:

GeForce GTX 780 Ti @ 336.000 GB/s

N               [GB/s]          [perc]          [usec]          test
1048576         157.33                  46.82   26.7             Pass
2097152         192.49                  57.29   43.6             Pass
4194304         233.10                  69.38   72.0             Pass
8388608         258.17                  76.84   130.0            Pass
16777216        273.89                  81.51   245.0            Pass
33554432        281.41                  83.75   476.9            Pass
67108864        285.67                  85.02   939.7            Pass
134217728       287.55                  85.58   1867.1

Non-base 2 tests!

N               [GB/s]          [perc]          [usec]          test
14680102        272.84                  81.20   215.2            Pass
14680119        272.76                  81.18   215.3            Pass
18875600        270.54                  80.52   279.1            Pass
7434886         165.25                  49.18   180.0            Pass
13324075        247.17                  73.56   215.6            Pass
15764213        257.93                  76.76   244.5            Pass
1850154         65.80           19.58   112.5            Pass
4991241         148.23                  44.12   134.7            Pass

Even with the infamous WDDM driver, still good numbers.

Wait, we can use atomic functions with floats now?

Re: the GTX 780ti in Windows → I though there would be a big performance hit due to use of the WDDM driver vs the TCC driver in the Tesla line.

At first there is a bit more overhead with the WDDM, but after that no issue so far.

Not only that somehow my CUDA implementation of next_permutation() which I have tested on a GTX 780ti in linux is actually FASTER in windows 7.

For example generating all permutations of a 13 element array took 8.19 seconds in linux (Ubuntu 12.10), while in W7 it took 6.95 seconds:

Testing raw permutation version.
GPU timing for 13!: 6.954 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu

Odd…

You were always able to use atomics on floats after compute 2.0. This reduction code does not use atomics at all though, I think you are looking at the wrong file.

Are you kidding me? I’ve been converting my floats to ints and… This whole time… Omg… I even had this whole topic here about how to use floats with atomicMin() and no one said that… Oh well, at least I learned how to radix sort?

I am having trouble accessing the attachments in this thread.

I guess that means that the original reduction code by JP is no longer correctly linked? If you are looking for that code you can PM or I can post, but not sure if JP took it down himself.

I too, am interested in the source code.

Please post it or PM how to get it. Tnx.

This was the OP’s original code.
IMO the best gauge of bandwidth.

http://pastebin.com/2pVhzeYD

Thanks for re-posting.

The GTX 980 numbers:

GeForce GTX 980 @ 224.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         125.25                  55.84   33.5             Pass
 2097152         145.45                  64.84   57.7             Pass
 4194304         159.49                  71.10   105.2            Pass
 8388608         168.04                  74.91   199.7            Pass
 16777216        172.79                  77.03   388.4            Pass
 33554432        175.20                  78.10   766.1            Pass
 67108864        176.43                  78.65   1521.5                   Pass
 134217728       177.06                  78.93   3032.1                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        172.29                  76.80   340.8            Pass
 14680119        172.36                  76.84   340.7            Pass
 18875600        171.25                  76.34   440.9            Pass
 7434886         123.42                  55.02   241.0            Pass
 13324075        161.86                  72.16   329.3            Pass
 15764213        166.43                  74.19   378.9            Pass
 1850154         58.77           26.20   125.9            Pass
 4991241         114.61                  51.09   174.2            Pass
Press any key to continue . . .

The GTX 780ti has is still the king in GBs, both in nominal terms, and as a percentage of theoretical maximum( it was about 86% of theoretical on the same test, though maybe there are some adjustments needed for Maxwell).

Yeah since they changed the forums around the original file was removed.

Looks like it needs a little tweaking for Maxwell perhaps? :-) Will give it a spin on my GTX980 when I have the time to fiddle with it.

Thanks CudaaduC for the helpful re-post of the source.

GTX 980m

GeForce GTX 980M @ 160.320 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         96.79           60.37   43.3             Pass
 2097152         112.60                  70.24   74.5             Pass
 4194304         128.75                  80.31   130.3            Pass
 8388608         136.75                  85.30   245.4            Pass
 16777216        141.31                  88.14   474.9            Pass
 33554432        143.84                  89.72   933.1            Pass
 67108864        144.86                  90.36   1853.1
 134217728       145.84                  90.97   3681.3

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        140.85                  87.85   416.9            Pass
 14680119        140.79                  87.82   417.1            Pass
 18875600        140.11                  87.39   538.9            Pass
 7434886         99.97           62.35   297.5            Pass
 13324075        132.60                  82.71   401.9            Pass
 15764213        136.11                  84.90   463.3            Pass
 1850154         47.10           29.38   157.1            Pass
 4991241         93.11           58.08   214.4            Pass

In terms of percentage of theoretical maximum, much better than the desktop version GTX 980.

In nominal terms not as good obviously, but overall the GTX 980m is very impressive for a laptop.

90.97% :-)

Would be curious to know the difference between the chips. How wide is memory bus?

256 I believe for the 980m

Using the Cuda-Z utility I get 3.4 Teraflops for 32 bit float and 1.035 TeraIops for 32 bit integers (whatever those values are worth in real-life terms).

The specifications for the GTX 980M states 1536 CUDA cores at 1.038 GHz base clock:

[url]http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-980m/specifications[/url]

This would result in 3.19 single-precision TFLOPS, but the GPU may apply clock boost during the benchmark run (are there vendor overclocked GPUs in the mobile field? Don’t know). I am not sure how CUDA-Z measures TeraIops. 32-bit IMAD throughput with each IMAD counting as two operations? 32-bit IADD throughput?