New G200 and compute capabilities

I am very pleased with the G200 improvements for CUDA! Doubled registers, voting atomics, more threads in flight, and double support were all among my biggest wishes. And the double ALU allocation is also a good compromise sweetspot… useful in its existence but not stealing too many transistors from the raw single precision engines. From reading the online reviews by the gaming sites, they’re all pleased but not enthused. But for CUDA, I have a big grin… it’s an excellent design evolution. Kudos to the NV engineers!

A few questions about the new architecture:

  1. G200 devices are all Compute Capability of 1.3. G92 is all 1.1. Is there any hardware that was only level 1.2? It seems as if we skipped a generation, though maybe this is a followup possibility for later devices that may not have the full double support. The practical question: do we ever have to worry about supporting a future device which has level 1.2 but not 1.3?

  2. The Compute Culpability listing in A.1.3 and A.1.4 don’t say so, but it seems as if native 64 bit integer support was also added for 1.2 devices. This is alluded to in section 5.2 of the Programming guide which talks about native long long using two registers in those devices. But it’d be nice to be explicitly documented.

  3. Are 32 bit integer multiplies still 16-clock ops, or are they now 4 clock in the new architecture? Previous programming guides strongly implied the overhead would be changing… and if there’s now 64 bit integer support I bet that changed the 32 bit mults too.

  4. Double math is nicely abstracted for us, it Just Works. But we know that there are fewer double ALUs than single ALUs… there’s only one double ALU per SM. So I’m guessing the thread scheduler must keep allocating the double units to threads as needed and keeping other pending threads waiting (similar to memory latency hiding.) The programming docs say nothing at all about how this works. Now since it’s abstracted, we may not need to know much, but perhaps it could affect code design if we knew any restrictions or inefficiencies. For example, it may be that deliberately interleaving float and double ops could help throughput? So my question is if there’s anything we need to know about double use that will help our code use our throughput as efficiently as possible. The programming guide doesn’t tell us anything.

  5. Also in regards to double precision support: We’re told pretty clearly what most float and integer operation speeds are in section (for floats, 4 clocks for ±* and mul-add) but nothing whatsover about double evaluation speed. Has anyone tried to measure this?

i thinke the CC 1.2 maybe just for something not yet release :)

The best technical summary of G200 has not been from the updated programming guide, nor from discussion here. Surprisingly, it’s from the Beyond3D site. It has a truly excellent technical architecture summary!

Really, this is required reading for all CUDA programmers right now, it answers questions nobody has known, including some of the questions I posted above.

Compute Capability 1.2 devices don’t exist now, but it’s likely there will be future low-end parts that do indeed remove the FP64 ALUs and therefore will not support 1.3. So you need to test for it!

G200 now supports 32 bit full speed mults, BUT only using the FP64 core! [Interesting!]

Great post!!!

Cheers SPWorley

Yes, nice post. What are the “voting atomics” though? I must have missed it if it was on that b3d article; the tech report says something about read-write-modify (I’m assuming this is all of the atomicAdd, etc.) into main memory, but I’m not sure how this is different from compute 1.1 (8600, which is what I have), or if it now allows sync between blocks?

more registers are always nice :) but I hope there are improvements for threads becoming more independent; UIUC’s CUDAcasts lecturer seemed to suggest there was a large penalty from instruction cache and instruction manager when threads diverged (followed different execution paths).

The voting primitives allow you to generate a boolean condition from all of the threads in a warp. __any() is a warp-wide logical OR, and __all() is a warp-wide logical AND. These can be used to select a branch that is guaranteed to be consistent across the warp, i.e. without divergence.

That’s as far as I understand them. I’m still experimenting to figure out what algorithms can benefit from warp voting.

If this were to change, it would be equivalent to having the warp size decrease, which has not happened.

My suspicion is that when the GT200 is scaled down in the future for mobile GPUs, they might consider dropping the double precision units to save on die space. I would not be surprised to see compute 1.2 devices for laptops.

(Disclaimer: this is speculation. I have no inside knowledge, of course.)