GTC Keynote Thread

Jimmy_Pettersson · May 17, 2012, 10:24am

max 255 registers per thread!

9x faster atomic operations over Fermi! :)

nitin.life · May 17, 2012, 9:28pm

http://www.nvidia.co…la-servers.html

Only 190 DP Gflops in total ( 95 Gflops per gpu in K10 ? ) … compared to 600+ DP Gflops in M2090 (FERMI) …

thoughts almost same in memory bandwidth per gpu (160 vs 170 in FERMI)

even though you have these features… it looks like the abs double performance is a lot slower compared to FERMI…

disappointing… :( ?

DrAnderson42 · May 17, 2012, 10:29pm

Wait for K20 to get double precision performance + more memory bandwidth. K10 is a “mid-range” Tesla.

ymc · May 18, 2012, 6:47am

k10 will have very poor sales External Image

Jimmy_Pettersson · May 18, 2012, 10:25am

I think K10 will be excellent for signal and image processing which is what they are pitching it for.

nnunn · May 19, 2012, 10:32am

Nice. For us, it’s sharing data between threads in a warp. Wondering to what degree “Shuffle” can replace our depenency on shared mem? From white paper (page 11)

[indent]

Shuffle instruction

To further improve performance, Kepler implements a new Shuffle instruction, which allows threads within a warp to share data. Previously, sharing data between threads within a warp required separate store and load operations to pass data through shared memory. […] [/indent]

Does GK104 enable Shuffle, or is this another GK110 feature?

njuffa · May 19, 2012, 10:43am

According to table 86 of the latest PTX specification available at

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/ptx_isa_3.0.pdf

“shfl requires sm_30 or higher”

GK104 implements the sm_30 architecture, and therefore provides the SFHL instruction. The CUDA C Programming Guide found at

http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf

describes, in section B.13, the warp shuffle intrinsics __shfl(), __shfl_up(), __shfl_down(), and __shfl_xor() that expose the instruction at the CUDA C level.

Jimmy_Pettersson · May 19, 2012, 11:32am

“Shuffle supports arbitrary indexed references - any thread reads from any other thread”

“store and load operations are carried out in a single step”

Jimmy_Pettersson · May 22, 2012, 5:04pm

Did anyone get confirmation on integer performance of GK110 ? I heard the question was asked at one of the keynotes but even the head architect wasnt able to answer that question.

Will it for example be able to perform one integer multiplication foreach of the 2880 cores per cc ?

RoBiK · May 22, 2012, 7:16pm

Throughput of Integer Arithmetic Instructions (Operations per Clock Cycle per Multiprocessor) for GK104 and GK110 is as follows:

32-bit integer add and compare: 160

32-bit integer shift: 32 for GK104 and 64 for GK110

32-bit integer multiply, multiply-add, sum of absolute difference: 32

These informations can be found in the CUDA C Programming Guide (Section 5.4.1) from the CUDA Toolkit 5 Documentation.

Jimmy_Pettersson · May 22, 2012, 8:17pm

Thanks, just downloaded :)

Ken_Domino · May 23, 2012, 12:53am

I asked Mark Harris a little about this at the end of the “S0338 - New Features In the CUDA Programming Model” talk. For example, one could now easily write reduction using a recursive post-order kernel fork/join. But, he felt it would probably be less efficient than current CUDA implementations. Obviously, a binary fork/join wouldn’t yield an efficient mapping into warps. However, he suggested an implementation using atomics might be pretty good. He couldn’t say exactly because he doesn’t yet have the new GPU to play with. So, it looks like there will be many more ways to write and optimize kernels with this and other new features.

Topic		Replies	Views
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9277	November 3, 2013
Unofficial Kepler Slides from Random Gamer Site Yeah, yeah, but we only have another week to rumor-m CUDA Programming and Performance	63	10543	April 5, 2012
Kepler and Maxwell, oh my! CUDA Programming and Performance	55	55846	October 19, 2010
Tesla 20-Series Features and Advantages CUDA Programming and Performance	65	152208	December 21, 2010
Concurrent kernel and events on Kepler CUDA Programming and Performance	16	11050	January 29, 2014
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6966	January 11, 2010
whats on the horizon tesla - fermi,kepler,maxwell CUDA Programming and Performance	4	3653	May 7, 2012
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11530	December 12, 2009
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6730	July 2, 2010
Seek advice on latest fermis CUDA Programming and Performance	14	1921	September 1, 2011

GTC Keynote Thread

Related topics