Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m

Nice try but we can’t comment on unreleased products…

mfatica,

Congratulations on the release of the new architecture!

Kepler is a released product at this time, correct? Why don’t you publish CUDA-related performance comparison numbers between Kepler and Fermi? We see a lot of video performance numbers over the Internet, but I couldn’t find much CUDA-related.

Thanks!

Wow, I just noticed the line in the whitepaper on atomic operations. They were already really fast on Fermi, but now they are between 3x and 11.7x faster on Kepler??

The latter I think just made the idea of work queues for threads way more practical…

I don’t understand the atomic “shared” vs “independent” address - can someone explain?

IIRC, that’s the difference between all threads in a warp using an address versus all threads using different addresses.

More on on-chip data:

“The shared data bandwidth for the Kepler core is 0.33B/FLOP with 32-bit accesses, just half of GF104. But the standard for general purpose workloads is not GF104. Fermi has 3× the shared data bandwidth (1B/FLOP) compared to Kepler. In comparison, AMD’s GCN has 1.5B/FLOP, demonstrating the advantages of a separate L1 data cache and local data share (LDS). The significant regression in communication bandwidth is one of the clearest signs that Nvidia has backed away from compute workloads in favor of graphics for Kepler. Note that using 64-bit accesses, the shared data bandwidth is actually 256B/cycle, which works out to 0.66B/FLOP (hence the asterisk in Table 1). However, existing CUDA programs are almost exclusively written with 32-bit accesses because earlier designs were fairly slow for 64-bit accesses.”- Impressions of Kepler

Kanter also seems pretty clear on that GPGPU and GPU-graphics are on a divergent path with GK104 going down the same line as GF104.

Since the gamer sites don’t do much CUDA benchmarking, maybe a direct, quantifiable question to ask is “How fast is OptiX on the GTX680 versus the GTX580?”

I am betting the new sm_30 PTX [font=“Courier New”]shfl[/font] opcode will become a favorite optimization for passing registers between warp lanes without touching shared memory. It was documented in one of the RC 4.1 PDF’s and then expunged. Oops!

Assuming it exists, I already have places in my kernels where I know I can use it.

I missed this. What is shfl supposed to do?

Probably what SSSE3’s _mm_shuffle_epi8() does. It lets a value from any lane in a warp to go to another lane in a warp. You can use it to implement small, parallel lookup tables without going to shared memory.

Exactly. It’s good for warp-sized scans, 32-value sorting networks, etc.

I assume, if it exists, that it will be faster than bouncing data through shared mem.

Squeezing more work into registers is what makes CUDA fun (?). External Image

Agreed. I felt really smug when I created a code for 1D convolution that is unrolled, and involves only 1 register load for each new output element! Massive register reuse between adjacent output elements. Best of all, it works for any filter size. The example in the CUDA SDK has hard coded filter sizes.

I’m looking forward to hierarchical register files that will increase efficiency even further. I wonder how disruptive of a change it will be. From what I read, it should all be transparent to the programmer since the local operand register are in the same namespace as the global registers, so no need to use different instructions.

The new CUDA 4.2 Toolkit posted here:

includes the updated Programming Guide (chapters 4, 5 and Appendix F) which confirms some things, and notes some other unexpected attributes in compute capability 3.0:

  • The execution time for an instruction (those with maximum throughput, anyway) is 11 clock cycles, down from 22 in Fermi.

  • Throughput of 32-bit operations are not identical. Max throughput per clock on an SMX is: 192 floating point multiply-add, 168 integer add, or 136 logical operations. These all had the same throughput in compute capability 2.0.

  • Relative to the throughput of single precision multiply-add, the throughput of integer shifts, integer comparison, and integer multiplication is lower than before.

  • The throughput of the intrinsic special functions relative to single precision floating point MAD is slightly higher than compute capability 2.0.

  • Max x dimension for a grid has been raised to 2**31 - 1.

  • Max # of blocks (16), warps (64), and threads (2048) per SMX have been raised.

  • Max size of 3D texture doubled in each dimension

  • Max # of textures bound to kernel doubled to 256.

  • Max # of surfaces bound to kernel doubled to 16.

  • New option to select 32 kB shared memory and 32 kB of L1, in addition to the previous 16/48 and 48/16 splits.

  • As mentioned elsewhere, each shared memory bank can deliver two 32-bit words per clock cycle. A two-word stride between threads no longer produces bank conflicts.

The PTX manual puts 3 more lines in the list of new features:

  • Support for sm_30 architectures.

  • SIMD video instructions

  • Warp shuffle

Do you know what does it mean warp shuffle?

Although I haven’t read it too thoroughly yet, the new (4.2, dated March 9, 2012) C programming guide states that the maximum number of threads per SMX is 2048, and with 65536 registers per SMX, this works out to a minimum of 32 registers per thread at full occupancy (!) Fermi, for comparison, had max 1536 threads using 32768 registers, which works out to about 21 registers per thread at maximum occupancy, so it looks like we’re actually getting some register relief here. Can someone from NVIDIA tell us what the maximum number of registers per thread is now? I believe it was 128 (or 127) for Tesla, and 63 for Fermi; I’m curious to see how they compare to Kepler.

Those new video instructions will definitely benefit machine vision applications I’m using for my robot. The only unfortunate thing is there is no packed multiply instructions, which I know is used extensively in some detection algos. But that’s not surprising since having extra multipliers for handling packed multiplies isn’t worth the cost and making reconfigurable, multiprecision multipliers isn’t easy (or is it?).

I’m also wondering why GK104 is using separate single and double precision units. I got the impression Fermi was using reconfigurable multiply-add units that can handle both single and double precision. That seemed to be a win.

The faster atomic operations will also be useful for image processing if they can speedup histograms.

Great job, NVIDIA.

A win for making fast DP computes, yes. But those capabilities are unused on 95% of all Fermi cards NVIDIA made, since they’re used for playing games, not doing compute. GK104’s design capitalizes on that market, using die space for more but smaller FP cores instead of bigger FP/DP cores.

So the interesting, very interesting, very very interesting question is what GK110’s architectural differences are from GK104’s. It can’t just be GK104 scaled up, since the DP throughput would not be enough. So it seems reasonable to speculate that the FP cores themselves are different on GK110. They’re likely more like Fermi’s. Pure speculation of course.

It is still 63 with the latest nvcc. A max reg count of 64 errors and 63 works (like Fermi):

[indent]

[font=“Courier New”]> nvcc -arch=sm_30 -Xptxas=-v -maxrregcount=64 test.cu[/font]

[font=“Courier New”][/font]

[font=“Courier New”]test.cu[/font]

[font=“Courier New”]ptxas warning : Too big maxrregcount value specified 64, will be ignored[/font]

[font=“Courier New”]ptxas info : Compiling entry function ‘_Z6kernelPj’ for ‘sm_30’[/font]

[font=“Courier New”][/font][/indent]

Can anybody please upload documentation only?
Btw, tell me please one thing, how can I obtain information of max block per multiprocessor from cuda 4.0 function getdeviceproperties?

I probably can, but I assume I’m not allowed to. You don’t need to install CUDA 4.2, you can also extract files using the [font=“Courier New”]–tar[/font] option to the installation script.

Find it quite odd that with CUDA 4.2 old compute capability 1.x devices have increased their integer addition throughput from 8 to 10 ops/cycle - that’s probably just a typo in table 5-1.

As allanmac and Uncle Joe, I love the [font=“Courier New”]shfl[/font] instruction. Just a few weeks ago I wrote some code where this would have been really helpful, and thought it shouldn’t be too difficult to implement in new hardware.