Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m

mfatica · March 22, 2012, 6:54pm

Nice try but we can’t comment on unreleased products…

cudesnick · March 22, 2012, 7:17pm

mfatica,

Congratulations on the release of the new architecture!

Kepler is a released product at this time, correct? Why don’t you publish CUDA-related performance comparison numbers between Kepler and Fermi? We see a lot of video performance numbers over the Internet, but I couldn’t find much CUDA-related.

Thanks!

seibert · March 22, 2012, 7:27pm

Wow, I just noticed the line in the whitepaper on atomic operations. They were already really fast on Fermi, but now they are between 3x and 11.7x faster on Kepler??

The latter I think just made the idea of work queues for threads way more practical…

jma · March 22, 2012, 7:59pm

I don’t understand the atomic “shared” vs “independent” address - can someone explain?

tmurray · March 22, 2012, 8:00pm

IIRC, that’s the difference between all threads in a warp using an address versus all threads using different addresses.

Jimmy_Pettersson · March 23, 2012, 3:15pm

More on on-chip data:

“The shared data bandwidth for the Kepler core is 0.33B/FLOP with 32-bit accesses, just half of GF104. But the standard for general purpose workloads is not GF104. Fermi has 3Ã— the shared data bandwidth (1B/FLOP) compared to Kepler. In comparison, AMDâ€™s GCN has 1.5B/FLOP, demonstrating the advantages of a separate L1 data cache and local data share (LDS). The significant regression in communication bandwidth is one of the clearest signs that Nvidia has backed away from compute workloads in favor of graphics for Kepler. Note that using 64-bit accesses, the shared data bandwidth is actually 256B/cycle, which works out to 0.66B/FLOP (hence the asterisk in Table 1). However, existing CUDA programs are almost exclusively written with 32-bit accesses because earlier designs were fairly slow for 64-bit accesses.”- Impressions of Kepler

Kanter also seems pretty clear on that GPGPU and GPU-graphics are on a divergent path with GK104 going down the same line as GF104.

SPWorley · March 23, 2012, 4:42pm

Since the gamer sites don’t do much CUDA benchmarking, maybe a direct, quantifiable question to ask is “How fast is OptiX on the GTX680 versus the GTX580?”

allanmac · March 23, 2012, 4:45pm

I am betting the new sm_30 PTX [font=“Courier New”]shfl[/font] opcode will become a favorite optimization for passing registers between warp lanes without touching shared memory. It was documented in one of the RC 4.1 PDF’s and then expunged. Oops!

Assuming it exists, I already have places in my kernels where I know I can use it.

seibert · March 23, 2012, 10:22pm

I missed this. What is shfl supposed to do?

Uncle_Joe · March 23, 2012, 10:50pm

Probably what SSSE3’s _mm_shuffle_epi8() does. It lets a value from any lane in a warp to go to another lane in a warp. You can use it to implement small, parallel lookup tables without going to shared memory.

allanmac · March 24, 2012, 12:15am

Exactly. It’s good for warp-sized scans, 32-value sorting networks, etc.

I assume, if it exists, that it will be faster than bouncing data through shared mem.

Squeezing more work into registers is what makes CUDA fun (?). External Image

Uncle_Joe · March 24, 2012, 1:08am

Agreed. I felt really smug when I created a code for 1D convolution that is unrolled, and involves only 1 register load for each new output element! Massive register reuse between adjacent output elements. Best of all, it works for any filter size. The example in the CUDA SDK has hard coded filter sizes.

I’m looking forward to hierarchical register files that will increase efficiency even further. I wonder how disruptive of a change it will be. From what I read, it should all be transparent to the programmer since the local operand register are in the same namespace as the global registers, so no need to use different instructions.

seibert · March 24, 2012, 3:29pm

There’s a white paper out at NVIDIA’s website that talks a little bit more about the SMX architecture. Looks like they’ve traded Fermi’s higher clock rate for more cores with shorter pipelines. Also, if I’m reading this right, the instruction latency is now constant for all instructions (or at least all the math instructions). Maybe now instruction latency will go from 18-26 with Fermi down to something like 10 or 12 with Kepler? That would give us 64*2^32 registers / 192 cores / 12 resident threads per core = 28 registers per thread, which is pretty close to before.

The white paper also makes it sound like with constant instruction latency, a lot of the scheduling can now be offloaded to the compiler/assembler, which makes sense. Anyone else have any thoughts? I can’t wait for some CUDA-related documentation to be released.

The new CUDA 4.2 Toolkit posted here:

includes the updated Programming Guide (chapters 4, 5 and Appendix F) which confirms some things, and notes some other unexpected attributes in compute capability 3.0:

The execution time for an instruction (those with maximum throughput, anyway) is 11 clock cycles, down from 22 in Fermi.
Throughput of 32-bit operations are not identical. Max throughput per clock on an SMX is: 192 floating point multiply-add, 168 integer add, or 136 logical operations. These all had the same throughput in compute capability 2.0.
Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.
The throughput of the intrinsic special functions relative to single precision floating point MAD is slightly higher than compute capability 2.0.
Max x dimension for a grid has been raised to 2**31 - 1.
Max # of blocks (16), warps (64), and threads (2048) per SMX have been raised.
Max size of 3D texture doubled in each dimension
Max # of textures bound to kernel doubled to 256.
Max # of surfaces bound to kernel doubled to 16.
New option to select 32 kB shared memory and 32 kB of L1, in addition to the previous 16/48 and 48/16 splits.
As mentioned elsewhere, each shared memory bank can deliver two 32-bit words per clock cycle. A two-word stride between threads no longer produces bank conflicts.

The PTX manual puts 3 more lines in the list of new features:

Support for sm_30 architectures.
SIMD video instructions
Warp shuffle

pasoleatis · March 24, 2012, 3:54pm

Do you know what does it mean warp shuffle?

kleboeuf · March 24, 2012, 10:14pm

Although I haven’t read it too thoroughly yet, the new (4.2, dated March 9, 2012) C programming guide states that the maximum number of threads per SMX is 2048, and with 65536 registers per SMX, this works out to a minimum of 32 registers per thread at full occupancy (!) Fermi, for comparison, had max 1536 threads using 32768 registers, which works out to about 21 registers per thread at maximum occupancy, so it looks like we’re actually getting some register relief here. Can someone from NVIDIA tell us what the maximum number of registers per thread is now? I believe it was 128 (or 127) for Tesla, and 63 for Fermi; I’m curious to see how they compare to Kepler.

Uncle_Joe · March 25, 2012, 1:52am

Those new video instructions will definitely benefit machine vision applications I’m using for my robot. The only unfortunate thing is there is no packed multiply instructions, which I know is used extensively in some detection algos. But that’s not surprising since having extra multipliers for handling packed multiplies isn’t worth the cost and making reconfigurable, multiprecision multipliers isn’t easy (or is it?).

I’m also wondering why GK104 is using separate single and double precision units. I got the impression Fermi was using reconfigurable multiply-add units that can handle both single and double precision. That seemed to be a win.

The faster atomic operations will also be useful for image processing if they can speedup histograms.

Great job, NVIDIA.

SPWorley · March 25, 2012, 3:09am

A win for making fast DP computes, yes. But those capabilities are unused on 95% of all Fermi cards NVIDIA made, since they’re used for playing games, not doing compute. GK104’s design capitalizes on that market, using die space for more but smaller FP cores instead of bigger FP/DP cores.

So the interesting, very interesting, very very interesting question is what GK110’s architectural differences are from GK104’s. It can’t just be GK104 scaled up, since the DP throughput would not be enough. So it seems reasonable to speculate that the FP cores themselves are different on GK110. They’re likely more like Fermi’s. Pure speculation of course.

allanmac · March 25, 2012, 3:30am

It is still 63 with the latest nvcc. A max reg count of 64 errors and 63 works (like Fermi):

[indent]

[font=“Courier New”]> nvcc -arch=sm_30 -Xptxas=-v -maxrregcount=64 test.cu[/font]

[font=“Courier New”]…[/font]

[font=“Courier New”]test.cu[/font]

[font=“Courier New”]ptxas warning : Too big maxrregcount value specified 64, will be ignored[/font]

[font=“Courier New”]ptxas info : Compiling entry function ‘_Z6kernelPj’ for ‘sm_30’[/font]

[font=“Courier New”]…[/font][/indent]

Lev · March 25, 2012, 8:31am

Can anybody please upload documentation only?
Btw, tell me please one thing, how can I obtain information of max block per multiprocessor from cuda 4.0 function getdeviceproperties?

tera · March 25, 2012, 11:59am

I probably can, but I assume I’m not allowed to. You don’t need to install CUDA 4.2, you can also extract files using the [font=“Courier New”]–tar[/font] option to the installation script.

Find it quite odd that with CUDA 4.2 old compute capability 1.x devices have increased their integer addition throughput from 8 to 10 ops/cycle - that’s probably just a typo in table 5-1.

As allanmac and Uncle Joe, I love the [font=“Courier New”]shfl[/font] instruction. Just a few weeks ago I wrote some code where this would have been really helpful, and thought it shouldn’t be too difficult to implement in new hardware.

Topic		Replies	Views
[Fermi] Number of registers CUDA Programming and Performance	36	20370	September 15, 2010
Any advice on adjusting code for Maxwell when coming from Kepler CUDA Programming and Performance	20	2907	November 6, 2014
GF100 vs GF104 Performance question CUDA Programming and Performance	18	9059	September 4, 2010
How to get more Gflops ? :) CUDA Programming and Performance	21	27764	September 12, 2008
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	205157	April 13, 2009
GTC Keynote Thread CUDA Programming and Performance	31	20005	May 23, 2012
Analysing the registers CUDA Programming and Performance	9	1242	March 13, 2012
GTX 460 CUDA Programming and Performance	58	60421	August 5, 2010
Fermi question CUDA Programming and Performance	30	5772	May 26, 2010
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	25026	September 6, 2009

Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m

Related topics