I looked up the source code for CUDA-Z in its SVN repository and 32-bit Iops performance does indeed seem to be based on 32-bit IMAD throughput, with each IMAD counting as two integer operations. Based on a quick compile/disassembly this instruction seems to be emulated by a three-instruction sequence on Maxwell (sm_50), so we would expect 1.063 Tiops at a base clock on 1.038 GHz for the GTX 980M.
So modulo some GPU clock boosting and some less than perfect efficiency vs theoretical, the numbers reported by CUDA-Z seem entirely plausible.
wow that is a beast. RAID SSDs in a laptop. Cuz, you know, one SSD is just not fast enough. I guess windows finishes booting right about when you release the power button…
Yeah I’ve noticed that it’s generally easier to achieve higher bandwidth utilization on low clocked wide memory interfaces VS high clocked thin interfaces.
For the GTX 980 you have 256-bit and 224 GB/s.
And for the 980m you have 256-bit and 160.3 GB/s.
I can’t wait for the arrival of HBM memories, it’s going to change everything :-)
You may be familiar with the old saying: “The more things change, the more they stay the same”.
There are dual challenges in the high-performance computing field that will not go away: the “memory wall” and the “power wall”.
While Moore’s Law is on its last legs (this is not just a question of technology but also of economy) it is not dead yet, so growth of FLOPS will continue for several more years. New memory technologies will be required just to avoid that the ratio of FLOPS to bandwidth becomes even more unfavorable than it is today. In addition, memory accesses are expensive in terms of energy compared to computation. Again, new memory technologies that shorten wire lengths are helpful in curbing that energy hunger (and by extension, power) but will not eliminate the fundamental problem.
From a software perspective, I expect that the need to reduce data movement and trade memory accesses against additional computation where possible will continue.
Yes, I saw a very interesting talk by Bill Dally a few years ago where he explained how a FMA costed ~7 pico joules while moving 8 bytes of data 10 mm cost ~200 pJ.
From what I understand HBM might offer up to 5-10x less energy per byte which would be able to carry us a little bit further into the future.
After 2020 though, maybe we’ll start to see processors with more exotic materials appearing.
I am not so upbeat about new processing technology, especially from the economic perspective. From what I understand, if one were to build a fab from scratch for 14 nm production today, the cost would be about US$ 10B. What entities can afford such investment? Once the fab has been built, what products can be made in it in sufficient quantity to recoup the investment (plus profit)? Companies that own fabs are already cutting costs by re-using existing shells, upgrading existing products lines instead of building new ones, etc, but that can take us only so far.
The history of technology provides many examples of exponential growth in the early phases of new technology (think steam engines, railways, automobiles) followed by long phases of low growth. I think we will hit that inflection point in the semiconductor industry fairly soon. Maybe this will lead to the ascendancy of software engineers, trying to squeeze ever more performance out of limited hardware?
Well Intel can afford it but right now but as costs begin to grow exponentially other fabs might begin to approach them on process node while it will always be too costly to go down to the same process node, the gap will become sufficiently small that the cost/competitive benefits will be too high. Then it will be more about hardware architecture and good programming…
For example Nvidia managed to improve the efficiency of Maxwell over Kepler on the same process node by nearly 50% when it comes to performance/watt.
10 BN USD is not far from the cost of developing the airbus A380.
From the device query. Notice the L2 cache size and 2 copy engines:
Device 0: "GeForce GTX TITAN X"
CUDA Driver Version / Runtime Version 7.0 / 6.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 12288 MBytes (12884901888 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Clock rate: 1076 MHz (1.08 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
I’ve been looking for the file: my_reduction.cu but it seems that has been deleted.
I’m working on a reduction and I have made it one with __shfl_down instructions. The thing is that it works fine on Kepler cards, but now I’m working on Fermi. I’m looking for a general code that can work in Fermi and in Kepler.
Can I ask you how can this approach to 90% of peak using only “add” operations? I thought peak was add+mul performance. So shouldn’t it get only 45%-50%? Is it about the “integer + floatingpoint” dual pipeline trick or some extra special function usage?