…
When I look at the version of this reduction code which does use __shfl,
float warpSum = warp_reduce_registers<x_threads, NbXGroups>( sum ) ;
// Broadcast warp redudced sums via shared memory
if( threadIdx.x%32 == 0 )
smem[threadIdx.y][threadIdx.x/32] = warpSum;
__syncthreads();
I am curious about the ‘Broadcast’ step of the warpSum value. I assume this is the ‘faster’ route to filling in the shared memory values. Unlike the usual ‘warpReduce’ step which is done when threadIdx.x<32, this only ‘Broadcasts’ when threadIdx.x is a multiple of warp size. So this is using registers rather than shared memory?
Also why is 64 used as the thread block size, when for Kepler it seems 256 tends to be optimal?
Thanks.
This is really great. Happy to see a fast sort. How does it fare on Kepler? Do you think you can beat 32-bit sort too?
Have not tested on Kepler yet. The solution does rely quite a bit on atomic operations where Kepler should provide further improvements over Fermi.
I did a implement a 32-bit sorter but by the time of poster submission it was “only” roughly on par with the B40C sorter (Google Code Archive - Long-term storage for Google Code Project Hosting.). Will see if I can contribute anything more there.
I tested this code on a GTX 780ti and thought to post the results:
GeForce GTX 780 Ti @ 336.000 GB/s
N [GB/s] [perc] [usec] test
1048576 157.33 46.82 26.7 Pass
2097152 192.49 57.29 43.6 Pass
4194304 233.10 69.38 72.0 Pass
8388608 258.17 76.84 130.0 Pass
16777216 273.89 81.51 245.0 Pass
33554432 281.41 83.75 476.9 Pass
67108864 285.67 85.02 939.7 Pass
134217728 287.55 85.58 1867.1
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 272.84 81.20 215.2 Pass
14680119 272.76 81.18 215.3 Pass
18875600 270.54 80.52 279.1 Pass
7434886 165.25 49.18 180.0 Pass
13324075 247.17 73.56 215.6 Pass
15764213 257.93 76.76 244.5 Pass
1850154 65.80 19.58 112.5 Pass
4991241 148.23 44.12 134.7 Pass
Even with the infamous WDDM driver, still good numbers.
Wait, we can use atomic functions with floats now?
Re: the GTX 780ti in Windows → I though there would be a big performance hit due to use of the WDDM driver vs the TCC driver in the Tesla line.
At first there is a bit more overhead with the WDDM, but after that no issue so far.
Not only that somehow my CUDA implementation of next_permutation() which I have tested on a GTX 780ti in linux is actually FASTER in windows 7.
For example generating all permutations of a 13 element array took 8.19 seconds in linux (Ubuntu 12.10), while in W7 it took 6.95 seconds:
Testing raw permutation version.
GPU timing for 13!: 6.954 seconds.
6227020800 permutations generated, took apx 1052366515200 iterations/calc on gpu
Odd…
You were always able to use atomics on floats after compute 2.0. This reduction code does not use atomics at all though, I think you are looking at the wrong file.
Are you kidding me? I’ve been converting my floats to ints and… This whole time… Omg… I even had this whole topic here about how to use floats with atomicMin() and no one said that… Oh well, at least I learned how to radix sort?
I am having trouble accessing the attachments in this thread.
I guess that means that the original reduction code by JP is no longer correctly linked? If you are looking for that code you can PM or I can post, but not sure if JP took it down himself.
I too, am interested in the source code.
Please post it or PM how to get it. Tnx.
This was the OP’s original code.
IMO the best gauge of bandwidth.
Thanks for re-posting.
The GTX 980 numbers:
GeForce GTX 980 @ 224.320 GB/s
N [GB/s] [perc] [usec] test
1048576 125.25 55.84 33.5 Pass
2097152 145.45 64.84 57.7 Pass
4194304 159.49 71.10 105.2 Pass
8388608 168.04 74.91 199.7 Pass
16777216 172.79 77.03 388.4 Pass
33554432 175.20 78.10 766.1 Pass
67108864 176.43 78.65 1521.5 Pass
134217728 177.06 78.93 3032.1 Pass
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 172.29 76.80 340.8 Pass
14680119 172.36 76.84 340.7 Pass
18875600 171.25 76.34 440.9 Pass
7434886 123.42 55.02 241.0 Pass
13324075 161.86 72.16 329.3 Pass
15764213 166.43 74.19 378.9 Pass
1850154 58.77 26.20 125.9 Pass
4991241 114.61 51.09 174.2 Pass
Press any key to continue . . .
The GTX 780ti has is still the king in GBs, both in nominal terms, and as a percentage of theoretical maximum( it was about 86% of theoretical on the same test, though maybe there are some adjustments needed for Maxwell).
Yeah since they changed the forums around the original file was removed.
Looks like it needs a little tweaking for Maxwell perhaps? :-) Will give it a spin on my GTX980 when I have the time to fiddle with it.
Thanks CudaaduC for the helpful re-post of the source.
GTX 980m
GeForce GTX 980M @ 160.320 GB/s
N [GB/s] [perc] [usec] test
1048576 96.79 60.37 43.3 Pass
2097152 112.60 70.24 74.5 Pass
4194304 128.75 80.31 130.3 Pass
8388608 136.75 85.30 245.4 Pass
16777216 141.31 88.14 474.9 Pass
33554432 143.84 89.72 933.1 Pass
67108864 144.86 90.36 1853.1
134217728 145.84 90.97 3681.3
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 140.85 87.85 416.9 Pass
14680119 140.79 87.82 417.1 Pass
18875600 140.11 87.39 538.9 Pass
7434886 99.97 62.35 297.5 Pass
13324075 132.60 82.71 401.9 Pass
15764213 136.11 84.90 463.3 Pass
1850154 47.10 29.38 157.1 Pass
4991241 93.11 58.08 214.4 Pass
In terms of percentage of theoretical maximum, much better than the desktop version GTX 980.
In nominal terms not as good obviously, but overall the GTX 980m is very impressive for a laptop.
90.97% :-)
Would be curious to know the difference between the chips. How wide is memory bus?
90.97% :-)
Would be curious to know the difference between the chips. How wide is memory bus?
256 I believe for the 980m
Using the Cuda-Z utility I get 3.4 Teraflops for 32 bit float and 1.035 TeraIops for 32 bit integers (whatever those values are worth in real-life terms).
The specifications for the GTX 980M states 1536 CUDA cores at 1.038 GHz base clock:
[url]http://www.geforce.com/hardware/notebook-gpus/geforce-gtx-980m/specifications[/url]
This would result in 3.19 single-precision TFLOPS, but the GPU may apply clock boost during the benchmark run (are there vendor overclocked GPUs in the mobile field? Don’t know). I am not sure how CUDA-Z measures TeraIops. 32-bit IMAD throughput with each IMAD counting as two operations? 32-bit IADD throughput?