GTC Keynote Thread

max 255 registers per thread!

9x faster atomic operations over Fermi! :)

http://www.nvidia.co…la-servers.html

Only 190 DP Gflops in total ( 95 Gflops per gpu in K10 ? ) … compared to 600+ DP Gflops in M2090 (FERMI) …

thoughts almost same in memory bandwidth per gpu (160 vs 170 in FERMI)

even though you have these features… it looks like the abs double performance is a lot slower compared to FERMI…

disappointing… :( ?

Wait for K20 to get double precision performance + more memory bandwidth. K10 is a “mid-range” Tesla.

k10 will have very poor sales External Image

I think K10 will be excellent for signal and image processing which is what they are pitching it for.

Nice. For us, it’s sharing data between threads in a warp. Wondering to what degree “Shuffle” can replace our depenency on shared mem? From white paper (page 11)

[indent]

Shuffle instruction

To further improve performance, Kepler implements a new Shuffle instruction, which allows threads within a warp to share data. Previously, sharing data between threads within a warp required separate store and load operations to pass data through shared memory. […] [/indent]

Does GK104 enable Shuffle, or is this another GK110 feature?

According to table 86 of the latest PTX specification available at

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_3.0.pdf

“shfl requires sm_30 or higher”

GK104 implements the sm_30 architecture, and therefore provides the SFHL instruction. The CUDA C Programming Guide found at

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

describes, in section B.13, the warp shuffle intrinsics __shfl(), __shfl_up(), __shfl_down(), and __shfl_xor() that expose the instruction at the CUDA C level.

“Shuffle supports arbitrary indexed references - any thread reads from any other thread”

“store and load operations are carried out in a single step”

Did anyone get confirmation on integer performance of GK110 ? I heard the question was asked at one of the keynotes but even the head architect wasnt able to answer that question.

Will it for example be able to perform one integer multiplication foreach of the 2880 cores per cc ?

Throughput of Integer Arithmetic Instructions (Operations per Clock Cycle per Multiprocessor) for GK104 and GK110 is as follows:

32-bit integer add and compare: 160

32-bit integer shift: 32 for GK104 and 64 for GK110

32-bit integer multiply, multiply-add, sum of absolute difference: 32

These informations can be found in the CUDA C Programming Guide (Section 5.4.1) from the CUDA Toolkit 5 Documentation.

Thanks, just downloaded :)

I asked Mark Harris a little about this at the end of the “S0338 - New Features In the CUDA Programming Model” talk. For example, one could now easily write reduction using a recursive post-order kernel fork/join. But, he felt it would probably be less efficient than current CUDA implementations. Obviously, a binary fork/join wouldn’t yield an efficient mapping into warps. However, he suggested an implementation using atomics might be pretty good. He couldn’t say exactly because he doesn’t yet have the new GPU to play with. So, it looks like there will be many more ways to write and optimize kernels with this and other new features.