GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48

apaehler · July 17, 2010, 3:57pm

Yesterday I traded in two GTX-260 for two Gainward GTX-460 with GDDR5. Based on the purported number of cores (336 - www.nvidia.com) I expected a peak rate of 907 GFlops. Were this correct, the performance of the GTX-460 would be dismal, just a tad faster than the GTX-260, both with CUDA and with OpenCL.

Using the gflops code posted a long time ago by Simon Green, I get about 515 GFlops. I have found gflops to be a reliable benchmark with GTX9800, GTX-260 and GTX-295.

I had first been running the 258.19 driver from the OpenCL 1.1beta and now have the official CUDA 3.1 release (256.35) running. Much to my suprise deviceQuery and oclDeviceQUery tell me there are 7 MPs and 224 (!!!) cores.

If I were to assume that 224 = 732 cores is correct rather than 336 = 748, the performance I observed so far would make sense. deviceQuery calculates the cores as nGpuArchCoresPerSM[deviceProp.major] * deviceProp.multiProcessorCount.

My compute capability is reported as 2.1. nGpuArchCoresPerSM[2] = 32 as for the GF-100.

Could someone from NVIDIA please shed light on this: how many cores does a GTX-460 MP have, and if were indeed 48, why then is the performance so miserable.

seibert · July 17, 2010, 4:14pm

This also matches up with the measurement from CUDA-Z (which might even use the same code) in another thread.

The calculation, as you point out, uses a lookup table that assumes the number of CUDA cores is constant for devices with the same compute capability major number. This is no longer true. Since CUDA 3.1 predates the release of compute capability 2.1 devices, the calculation is wrong. Several people have asked for a field to be added to the device properties structure to future-proof our code against such problems. We saw the same problems with compute 2.0 devices using older programs that had hardcoded the number of CUDA cores per MP to be 8.

The first part has been more or less confirmed by tmurray (as well as a slew of reviewers) as being 48 CUDA cores per MP. I think this is showing that the peak performance of the new architecture is more sensitive to instruction scheduling. I’m curious to know if this is something that can be improved in the next CUDA toolkit and driver release…

seibert · July 17, 2010, 4:18pm

As an aside, this is starting to sound like the long standing confusion with counting GFLOPS with compute 1.x, which can dual issue a MUL with a MAD.

apaehler · July 17, 2010, 4:46pm

Well I hope not. I always base peak gflops estimates on 2 flops/cycle (MAD), the nominal 907 is for 2 flops/cycle. So if the GTX-460 really does have 48 cores per MP, then the performance with existing compilers and drivers is not any better than the of the GTX-260. I can live with that for codes, that are not memory-constrained. The performance measured as GFlops/$ is then about the same for both in my case.

tera · July 17, 2010, 6:03pm

Have you tried compiling the code for sm_21? Interestingly enough, CUDA 3.1 accepts this, although I haven’t tested yet whether it actually produces different code (and if it does, I cannot report results as I lack a suitable device).

apaehler · July 17, 2010, 6:35pm

Yes I did, tried both sm_20 and sm_21. No noticeable difference. I do however get from the build log of an OpenCL program:

[codebox]Program : 0x9ca2a0

Build Log ......

: Considering profile ‘compute_20’ for gpu=‘sm_21’ in ‘cuModuleLoadDataEx_a’

[/codebox]

Please, note the compute_20 and gpu=sm_21. This is a decision made by the NVIDIA OpenCL compiler, not by me.

Jimmy_Pettersson · July 17, 2010, 9:17pm

As already suggested the interesting question here is wheter or not the test has good enough ILP to utilize those extra cores. Juding from your results it seems it doesn’t.

seibert · July 17, 2010, 9:59pm

Sure, what I mean is that when the dual-issue MUL was part of the architecture (it was dropped for Fermi, with good reason), the peak GFLOPS often quoted in marketing materials was significantly higher than what many people saw on compute-bound workloads. Except for some carefully crafted code, that extra MUL ability often went unused.

These initial reports make it sound like compute capability 2.1 is back in the same situation again, with the gap between peak marketing GFLOPS and “typical” GFLOPS being rather large due to the inability of programs to fully utilize the third group of 16 CUDA cores in each MP. I hope that driver and compiler updates improve this situation.

Also, I’m starting to see the wisdom of tmurray’s admonishment in another thread to not pay attention to CUDA cores so much. Instead, we should be benchmarking throughput per multiprocessor for each compute capability and scaling that by # of MPs to predict performance of compute-bound workloads. This is more work, given the now 6 different compute capabilities that have been released, although you can group 1.0/1.1 and 1.2/1.3, leaving you with four major platforms. I expect future compute capabilities to tweak and rebalance resources, changing things further. (nvcc already has options for sm_22, sm_23, and sm_30…)

My takeaway message from the GF100 and GF104 releases is that peak GFLOPS and peak global memory bandwidth are gradually becoming worse predictors of kernel performance. Unless you are hand-coding some numerically intense block of assembly, I don’t know anyone who uses peak GFLOPS on a CPU as a predictor of performance on real life code. GPUs are increasing in complexity to the point where their non-linearity will be comparable to a CPU.

I don’t consider this a criticism of CUDA (though I’ll miss the simpler days), but more of a result of the growing flexibility and diversity of CUDA architectures.

apaehler · July 17, 2010, 10:49pm

Sure, what I mean is that when the dual-issue MUL was part of the architecture (it was dropped for Fermi, with good reason), the peak GFLOPS often quoted in marketing materials was significantly higher than what many people saw on compute-bound workloads. Except for some carefully crafted code, that extra MUL ability often went unused.

These initial reports make it sound like compute capability 2.1 is back in the same situation again, with the gap between peak marketing GFLOPS and “typical” GFLOPS being rather large due to the inability of programs to fully utilize the third group of 16 CUDA cores in each MP. I hope that driver and compiler updates improve this situation.

Also, I’m starting to see the wisdom of tmurray’s admonishment in another thread to not pay attention to CUDA cores so much. Instead, we should be benchmarking throughput per multiprocessor for each compute capability and scaling that by # of MPs to predict performance of compute-bound workloads. This is more work, given the now 6 different compute capabilities that have been released, although you can group 1.0/1.1 and 1.2/1.3, leaving you with four major platforms. I expect future compute capabilities to tweak and rebalance resources, changing things further. (nvcc already has options for sm_22, sm_23, and sm_30…)

My takeaway message from the GF100 and GF104 releases is that peak GFLOPS and peak global memory bandwidth are gradually becoming worse predictors of kernel performance. Unless you are hand-coding some numerically intense block of assembly, I don’t know anyone who uses peak GFLOPS on a CPU as a predictor of performance on real life code. GPUs are increasing in complexity to the point where their non-linearity will be comparable to a CPU.

I don’t consider this a criticism of CUDA (though I’ll miss the simpler days), but more of a result of the growing flexibility and diversity of CUDA architectures.

I wholeheartedly agree with your comments. By now I have benchmarked a whole slew of programs both on a laptop with GT 330 (48 cores) and GTX 460. Whenever I divide the time for the GT 330 by the time for the GTX 460, multiply that with 48 (GT 330 cores), divide by 7 (GTX 460 MPs), I end up with about 32 as the result. As long as the calculation is not constrained by memory. My conclusion is, that the GTX may well have 48 cores per MP, but is actually using only 32 of them, with an additional 16 sitting idle.

Scaling by number of MPs doesn’t make sense. In that case my laptop GPU (6 MPs, 1.265 GHz) should be almost as fast as my desktop GPUs (7 MPs, 1.35 GHz).

I am quite happy with the additional features the Fermi architecture offers. and, as far as temperatures are concerned, my GTX 460 are about 38 degrees centigrade. when idle, while my GTX 260 never went below about 53 degrees. The GF-104 certainly isn’t a hot chip. A colleague with a GTX-470 told me that he easily reaches about 90 degrees centigrade-

apaehler · July 17, 2010, 11:01pm

For all GPUs I tested so far gflops has scaled extremely well, with GFlops about 97-99% of nominal peak. gflops gives about 85% of peak for 32 cores, but only about 55% for 48 cores.

seibert · July 17, 2010, 11:01pm

That’s not what I mean. Your laptop and desktop GPUs are almost certainly not the same compute capability. Basically, scaling performance purely based on CUDA cores across compute capabilities is not working anymore. Instead, you have benchmark on each compute capability and then scale to other devices within that capability by # of MPs. (Assuming compute bound workloads, etc, etc.)

apaehler · July 17, 2010, 11:07pm

OK, agreed - GT 330 is 1.2, GTX 260 is 1.3 - and I can scale “reasonably” well between them - GTX 460 is 2.1. Is there any way to measure how many cores are actually used - there should be a way to check that. Kind of like the micro benchmarking done for the G200. And I don’t mean an occupancy calculator kind of tools, but some real benchmark code. Given that Fermi has not been around for long, it is obviously difficult to predict performance based on past experience.

Jimmy_Pettersson · July 17, 2010, 11:07pm

I think he meant that you would look at the theoretical or tested throuhput of each MP and go from there.

Ex:

1 laptop GPU MP => 40 GFLOPS

1 desktop GPU MP = > 60 GFLOPS

So if the desktop GPU has 7 MP’s and the laptop version 6 MP’s you would kind of know what to expect…

EDIT: ok, seibert beat me to it :)

tera · July 18, 2010, 12:41am

One of the reasons a 460 produces less heat might be that a third of the cores just sits idle? [/sarcasm]

I really wonder what holds them back from being used. Alternatively, I wonder why Nvidia went superscalar and didn’t just feed them from a third independent warp.

Jimmy_Pettersson · July 18, 2010, 1:34am

@tera

I guess it’s a design / yield issue? Adding another scheduler along with the extra memory resources needed might have produced more broken SM’s.

With this design they increase the number of SP’s without being forced to increase the on-chip memory resources accordingly.

seibert · July 18, 2010, 3:07am

Yeah, I suspect it is a two-fold win: No need for a third warp scheduler and also no additional pressure to increase the maximum number of active warps along with the size of the register file. A SM with 3 schedulers and 3 sets of 16 CUDA cores would probably need even more active warps to keep all the pipelines full. By spending a some die area to make the two schedulers superscalar, you save elsewhere. (And you have to trust that your compiler writers can generate code that exposes as much ILP as possible.)

Topic		Replies	Views
GTX460 number of multiprocessors CUDA Programming and Performance	16	10160	September 22, 2010
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10866	August 15, 2010
GTX 460 CUDA Programming and Performance	58	60254	August 5, 2010
Cuda cores of GTX460 CUDA Programming and Performance	3	18208	January 27, 2011
GF100 vs GF104 Performance question CUDA Programming and Performance	18	8935	September 4, 2010
Cores in Tesla c2050 card shows 112 cores instead of 448 CUDA Programming and Performance	6	11246	September 4, 2010
GTX 460: number of cores per multiprocessor? CUDA Programming and Performance	6	10714	July 12, 2010
How to compute performance in GFLOPS ? CUDA Programming and Performance	25	12083	November 17, 2008
gtx 470 showing 112 cores CUDA Programming and Performance	8	7538	June 29, 2010
Nvidia GF104 vs GF100 CUDA Programming and Performance	24	22969	October 12, 2010

GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48

Related topics