Speedy general reduction sum code ( ~88.5 % of peak ) Updated for Kepler! __shfl() .... etc,.

I checked the actual clock and you are right there must be a boost applied because the same utility shows the clock as 1.1265 Ghz.

This is the exact laptop which I used to run the test:

http://www.newegg.com/Product/Product.aspx?Item=N82E16834232199&cm_re=gtx_980m--34-232-199--Product

Overall so far I am very happy with this laptop, though I find the Metro GUI for Windows 8.1 to be annoying.

I looked up the source code for CUDA-Z in its SVN repository and 32-bit Iops performance does indeed seem to be based on 32-bit IMAD throughput, with each IMAD counting as two integer operations. Based on a quick compile/disassembly this instruction seems to be emulated by a three-instruction sequence on Maxwell (sm_50), so we would expect 1.063 Tiops at a base clock on 1.038 GHz for the GTX 980M.

So modulo some GPU clock boosting and some less than perfect efficiency vs theoretical, the numbers reported by CUDA-Z seem entirely plausible.

wow that is a beast. RAID SSDs in a laptop. Cuz, you know, one SSD is just not fast enough. I guess windows finishes booting right about when you release the power button…

I need to have a talk with Santa Claus…

Yeah I’ve noticed that it’s generally easier to achieve higher bandwidth utilization on low clocked wide memory interfaces VS high clocked thin interfaces.

For the GTX 980 you have 256-bit and 224 GB/s.
And for the 980m you have 256-bit and 160.3 GB/s.

I can’t wait for the arrival of HBM memories, it’s going to change everything :-)

You may be familiar with the old saying: “The more things change, the more they stay the same”.

There are dual challenges in the high-performance computing field that will not go away: the “memory wall” and the “power wall”.

While Moore’s Law is on its last legs (this is not just a question of technology but also of economy) it is not dead yet, so growth of FLOPS will continue for several more years. New memory technologies will be required just to avoid that the ratio of FLOPS to bandwidth becomes even more unfavorable than it is today. In addition, memory accesses are expensive in terms of energy compared to computation. Again, new memory technologies that shorten wire lengths are helpful in curbing that energy hunger (and by extension, power) but will not eliminate the fundamental problem.

From a software perspective, I expect that the need to reduce data movement and trade memory accesses against additional computation where possible will continue.

Yes, I saw a very interesting talk by Bill Dally a few years ago where he explained how a FMA costed ~7 pico joules while moving 8 bytes of data 10 mm cost ~200 pJ.

From what I understand HBM might offer up to 5-10x less energy per byte which would be able to carry us a little bit further into the future.

After 2020 though, maybe we’ll start to see processors with more exotic materials appearing.

I am not so upbeat about new processing technology, especially from the economic perspective. From what I understand, if one were to build a fab from scratch for 14 nm production today, the cost would be about US$ 10B. What entities can afford such investment? Once the fab has been built, what products can be made in it in sufficient quantity to recoup the investment (plus profit)? Companies that own fabs are already cutting costs by re-using existing shells, upgrading existing products lines instead of building new ones, etc, but that can take us only so far.

The history of technology provides many examples of exponential growth in the early phases of new technology (think steam engines, railways, automobiles) followed by long phases of low growth. I think we will hit that inflection point in the semiconductor industry fairly soon. Maybe this will lead to the ascendancy of software engineers, trying to squeeze ever more performance out of limited hardware?

Well Intel can afford it but right now but as costs begin to grow exponentially other fabs might begin to approach them on process node while it will always be too costly to go down to the same process node, the gap will become sufficiently small that the cost/competitive benefits will be too high. Then it will be more about hardware architecture and good programming…

For example Nvidia managed to improve the efficiency of Maxwell over Kepler on the same process node by nearly 50% when it comes to performance/watt.

10 BN USD is not far from the cost of developing the airbus A380.

A stock(clock 1076 MHz) Nvidia Titan X with the same code no modifications:

GeForce GTX TITAN X @ 336.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         158.08                  46.98   26.5             Pass
 2097152         200.91                  59.71   41.8             Pass
 4194304         227.89                  67.73   73.6             Pass
 8388608         256.18                  76.14   131.0            Pass
 16777216        270.37                  80.35   248.2            Pass
 33554432        277.86                  82.58   483.0            Pass
 67108864        282.16                  83.86   951.4            Pass
 134217728       284.32                  84.50   1888.3                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        269.55                  80.11   217.8            Pass
 14680119        269.50                  80.09   217.9            Pass
 18875600        266.72                  79.27   283.1            Pass
 7434886         155.26                  46.14   191.5            Pass
 13324075        240.78                  71.56   221.4            Pass
 15764213        253.07                  75.21   249.2            Pass
 1850154         60.10           17.86   123.1            Pass
 4991241         139.95                  41.59   142.7            Pass
Press any key to continue . . .

About the same as the Kepler Titan, and still (by a very slight margin) the GTX 780ti is the champ (at least with this code intended for Kepler).

I wonder if there are any improvement modifications for Maxwell?

From the device query. Notice the L2 cache size and 2 copy engines:

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          7.0 / 6.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 12288 MBytes (12884901888 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Clock rate:                                1076 MHz (1.08 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

I’ve been looking for the file: my_reduction.cu but it seems that has been deleted.

I’m working on a reduction and I have made it one with __shfl_down instructions. The thing is that it works fine on Kepler cards, but now I’m working on Fermi. I’m looking for a general code that can work in Fermi and in Kepler.

here you go;

http://pastebin.com/s04Y1xmA

Result on an EVGA GTX Titan X with both clocks set to ‘max_supported’ without overclocking;

GeForce GTX TITAN X @ 336.480 GB/s

 N               [GB/s]          [perc]          [usec]          test
 1048576         185.39                  55.10   22.6             Pass
 2097152         229.07                  68.08   36.6             Pass
 4194304         257.47                  76.52   65.2             Pass
 8388608         279.53                  83.08   120.0            Pass
 16777216        291.53                  86.64   230.2            Pass
 33554432        298.22                  88.63   450.1            Pass
 67108864        301.60                  89.63   890.0            Pass
 134217728       303.35                  90.15   1769.8                   Pass

 Non-base 2 tests!

 N               [GB/s]          [perc]          [usec]          test
 14680102        289.38                  86.00   202.9            Pass
 14680119        289.28                  85.97   203.0            Pass
 18875600        285.88                  84.96   264.1            Pass
 7434886         172.49                  51.26   172.4            Pass
 13324075        261.00                  77.57   204.2            Pass
 15764213        272.64                  81.03   231.3            Pass
 1850154         68.14           20.25   108.6            Pass
 4991241         155.74                  46.29   128.2            Pass
Press any key to continue . . .

Can I ask you how can this approach to 90% of peak using only “add” operations? I thought peak was add+mul performance. So shouldn’t it get only 45%-50%? Is it about the “integer + floatingpoint” dual pipeline trick or some extra special function usage?

these are percents of peak memory bandwidth