GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48

Yesterday I traded in two GTX-260 for two Gainward GTX-460 with GDDR5. Based on the purported number of cores (336 - www.nvidia.com) I expected a peak rate of 907 GFlops. Were this correct, the performance of the GTX-460 would be dismal, just a tad faster than the GTX-260, both with CUDA and with OpenCL.

Using the gflops code posted a long time ago by Simon Green, I get about 515 GFlops. I have found gflops to be a reliable benchmark with GTX9800, GTX-260 and GTX-295.

I had first been running the 258.19 driver from the OpenCL 1.1beta and now have the official CUDA 3.1 release (256.35) running. Much to my suprise deviceQuery and oclDeviceQUery tell me there are 7 MPs and 224 (!!!) cores.

If I were to assume that 224 = 732 cores is correct rather than 336 = 748, the performance I observed so far would make sense. deviceQuery calculates the cores as nGpuArchCoresPerSM[deviceProp.major] * deviceProp.multiProcessorCount.

My compute capability is reported as 2.1. nGpuArchCoresPerSM[2] = 32 as for the GF-100.

Could someone from NVIDIA please shed light on this: how many cores does a GTX-460 MP have, and if were indeed 48, why then is the performance so miserable.

This also matches up with the measurement from CUDA-Z (which might even use the same code) in another thread.

The calculation, as you point out, uses a lookup table that assumes the number of CUDA cores is constant for devices with the same compute capability major number. This is no longer true. Since CUDA 3.1 predates the release of compute capability 2.1 devices, the calculation is wrong. Several people have asked for a field to be added to the device properties structure to future-proof our code against such problems. We saw the same problems with compute 2.0 devices using older programs that had hardcoded the number of CUDA cores per MP to be 8.

The first part has been more or less confirmed by tmurray (as well as a slew of reviewers) as being 48 CUDA cores per MP. I think this is showing that the peak performance of the new architecture is more sensitive to instruction scheduling. I’m curious to know if this is something that can be improved in the next CUDA toolkit and driver release…

As an aside, this is starting to sound like the long standing confusion with counting GFLOPS with compute 1.x, which can dual issue a MUL with a MAD.

Well I hope not. I always base peak gflops estimates on 2 flops/cycle (MAD), the nominal 907 is for 2 flops/cycle. So if the GTX-460 really does have 48 cores per MP, then the performance with existing compilers and drivers is not any better than the of the GTX-260. I can live with that for codes, that are not memory-constrained. The performance measured as GFlops/$ is then about the same for both in my case.

Have you tried compiling the code for sm_21? Interestingly enough, CUDA 3.1 accepts this, although I haven’t tested yet whether it actually produces different code (and if it does, I cannot report results as I lack a suitable device).

Yes I did, tried both sm_20 and sm_21. No noticeable difference. I do however get from the build log of an OpenCL program:

[codebox]Program : 0x9ca2a0

Build Log ...... 

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_a’

[/codebox]

Please, note the compute_20 and gpu=sm_21. This is a decision made by the NVIDIA OpenCL compiler, not by me.

As already suggested the interesting question here is wheter or not the test has good enough ILP to utilize those extra cores. Juding from your results it seems it doesn’t.

Sure, what I mean is that when the dual-issue MUL was part of the architecture (it was dropped for Fermi, with good reason), the peak GFLOPS often quoted in marketing materials was significantly higher than what many people saw on compute-bound workloads. Except for some carefully crafted code, that extra MUL ability often went unused.

These initial reports make it sound like compute capability 2.1 is back in the same situation again, with the gap between peak marketing GFLOPS and “typical” GFLOPS being rather large due to the inability of programs to fully utilize the third group of 16 CUDA cores in each MP. I hope that driver and compiler updates improve this situation.

Also, I’m starting to see the wisdom of tmurray’s admonishment in another thread to not pay attention to CUDA cores so much. Instead, we should be benchmarking throughput per multiprocessor for each compute capability and scaling that by # of MPs to predict performance of compute-bound workloads. This is more work, given the now 6 different compute capabilities that have been released, although you can group 1.0/1.1 and 1.2/1.3, leaving you with four major platforms. I expect future compute capabilities to tweak and rebalance resources, changing things further. (nvcc already has options for sm_22, sm_23, and sm_30…)

My takeaway message from the GF100 and GF104 releases is that peak GFLOPS and peak global memory bandwidth are gradually becoming worse predictors of kernel performance. Unless you are hand-coding some numerically intense block of assembly, I don’t know anyone who uses peak GFLOPS on a CPU as a predictor of performance on real life code. GPUs are increasing in complexity to the point where their non-linearity will be comparable to a CPU.

I don’t consider this a criticism of CUDA (though I’ll miss the simpler days), but more of a result of the growing flexibility and diversity of CUDA architectures.

I wholeheartedly agree with your comments. By now I have benchmarked a whole slew of programs both on a laptop with GT 330 (48 cores) and GTX 460. Whenever I divide the time for the GT 330 by the time for the GTX 460, multiply that with 48 (GT 330 cores), divide by 7 (GTX 460 MPs), I end up with about 32 as the result. As long as the calculation is not constrained by memory. My conclusion is, that the GTX may well have 48 cores per MP, but is actually using only 32 of them, with an additional 16 sitting idle.

Scaling by number of MPs doesn’t make sense. In that case my laptop GPU (6 MPs, 1.265 GHz) should be almost as fast as my desktop GPUs (7 MPs, 1.35 GHz).

I am quite happy with the additional features the Fermi architecture offers. and, as far as temperatures are concerned, my GTX 460 are about 38 degrees centigrade. when idle, while my GTX 260 never went below about 53 degrees. The GF-104 certainly isn’t a hot chip. A colleague with a GTX-470 told me that he easily reaches about 90 degrees centigrade-

For all GPUs I tested so far gflops has scaled extremely well, with GFlops about 97-99% of nominal peak. gflops gives about 85% of peak for 32 cores, but only about 55% for 48 cores.

That’s not what I mean. Your laptop and desktop GPUs are almost certainly not the same compute capability. Basically, scaling performance purely based on CUDA cores across compute capabilities is not working anymore. Instead, you have benchmark on each compute capability and then scale to other devices within that capability by # of MPs. (Assuming compute bound workloads, etc, etc.)

OK, agreed - GT 330 is 1.2, GTX 260 is 1.3 - and I can scale “reasonably” well between them - GTX 460 is 2.1. Is there any way to measure how many cores are actually used - there should be a way to check that. Kind of like the micro benchmarking done for the G200. And I don’t mean an occupancy calculator kind of tools, but some real benchmark code. Given that Fermi has not been around for long, it is obviously difficult to predict performance based on past experience.

I think he meant that you would look at the theoretical or tested throuhput of each MP and go from there.

Ex:

1 laptop GPU MP => 40 GFLOPS

1 desktop GPU MP = > 60 GFLOPS

So if the desktop GPU has 7 MP’s and the laptop version 6 MP’s you would kind of know what to expect…

EDIT: ok, seibert beat me to it :)

One of the reasons a 460 produces less heat might be that a third of the cores just sits idle? [/sarcasm]

I really wonder what holds them back from being used. Alternatively, I wonder why Nvidia went superscalar and didn’t just feed them from a third independent warp.

@tera

I guess it’s a design / yield issue? Adding another scheduler along with the extra memory resources needed might have produced more broken SM’s.

With this design they increase the number of SP’s without being forced to increase the on-chip memory resources accordingly.

Yeah, I suspect it is a two-fold win: No need for a third warp scheduler and also no additional pressure to increase the maximum number of active warps along with the size of the register file. A SM with 3 schedulers and 3 sets of 16 CUDA cores would probably need even more active warps to keep all the pipelines full. By spending a some die area to make the two schedulers superscalar, you save elsewhere. (And you have to trust that your compiler writers can generate code that exposes as much ILP as possible.)