192 cuda cores - how they are organized 6x32 or 4x32 + 4x16?

vvolkov · April 28, 2012, 10:40pm

Interesting reading: “Nvidia’s GeForce GTX 680 graphics processor” by Scott Wasson. A citation:

That’s curious as the best multiply-add throughput I can get on GTX 680 corresponds to 5.5 x 32 ALUs per SM, not 6 x 32 as the sheer core count suggests. Also, the best throughput in integer operations corresponds to 32 cores per SM.

Comments?

allanmac · April 29, 2012, 12:17am

That integer measurement sounds like a match. The throughput table in the updated programming guide (section 5.4.1) says 32 ops/clock/SM for shift/mul/mad/sad.

allanmac · April 29, 2012, 12:30am

Random idea: have you tried using a different multiple of warps (60 vs. 64, 30 vs. 32)? I’m wondering if the new Kepler scheduler could benefit from a simpler mapping between warps and the 6x32 (4x48?) vector units? Just a thought…

Also, if compiler is determining instruction scheduling and, implicitly, warp scheduling up front then perhaps handing the compiler explicit kernel launch bounds (section B.19 of Guide) might help?

No idea. I will get a 680 right after the GTC. :)

vvolkov · April 29, 2012, 3:05am

If I run only 1 thread block, then the best performance is at the multiples of 4 warps. Sounds reasonable given 4 warp schedulers per SM.

But can’t get good performance unless running at least 2 warps per scheduler! With 1 warp the throughput is nearly exactly 2x lower. Have seen a similar story on G80 - it couldn’t run fast with 1 warp per SM, or even 1 warp per block. (But 2 warps were just fine.)

B.19 doesn’t talk about scheduling, only about register usage. Still, might be worth a try…

Don’t you want to wait for the widely rumored “big Kepler”? External Image

Thanks for this pointer! Now I am totally confused. Throughput of 32-bit integer adds is 168 per SM, logical operations are at 136 per SM - these are not even multiples of 16! How does it all work?!?

allanmac · April 29, 2012, 5:16am

There is a 4/5/2012 and a 4/16/2012 Programming Guide (on the developer site). The Win64 and OSX 4.2 installers have the 4/5/2012 copy – at least on my machines.

The simplest reading of the table in the latest guide seems to imply that out of the 192 cores there are 32 that can perform int32 shift/mul/mad/sad each clock and the remaining 160 can do int32 comparisons and logic (assuming all the operations are actually single cycle). Hopefully someone from NVIDIA can provide details at the GTC session on Kepler. :)

The aluminum/magnesium/polycarbonate GTX 690 was announced tonight in Shanghai. I’m not sure I could bear putting that card on the inside of my case given how cool it looks. If the “Big Kepler” is going to top the GTX 690 it better come shrouded in gold!

allanmac · April 29, 2012, 5:20am

(table attached)