Can someone from NVIDIA clarify if the GTX 460 has 48 cores per multiprocessor? (48 x 7 = 336)
And do those 48 cores still share the same 32K of registers and (up to) 48KB of shared memory per multiprocessor as other “Compute Capability 2.0” devices?
If all this is true, I assume that the major change is that three half-warps are being scheduled at a time and therefore the per-SM throughput numbers are 50% higher?
Or is it more complicated than this? (of course it is!)
+1 for the clarification on the architecture.
From the write up at Anandtech, it sounds like the SM is still dual issue, one half-warp per scheduler per clock cycle like the GF100. The difference is that the schedulers can do some sort of out-of-order execution on a half-warp if it is permissible, so two independent instructions from a half warp might be issued and run at the same time if the code permits.
I’m also curious, with the reduction of L2 cache size and increase in instruction throughput per SM, whether this GPU gets a compute capability version bump or not. It’s really on the border, so I suspect not, but so far devices with the same compute capability have only differed in the number of memory channels and number of SMs. (Not counting the special Tesla C20X0 features…)
Thanks, looking forward to an updated CUDA C Programming Guide with the gory details.
there aren’t many! that’s the primary change, I don’t know of any other developer-visible ones
Side note: I like the NVIDIA marketing spin on the GTX 470/480 and how it is the “tank” of the product line. Very cool artwork for both gamers and CUDAlators.