GTX 460

If it performs so bad computationally, how come it performs almost at the same level as 465 graphically? Does that mean processing graphics is optimized to utilize 48 cores most of the time?

Games usually aren’t computationally bound and so SP/ALU count and clock rate not that important (AFAIK – I’m not so familiar with modern gaming). You can also compare NVIDIA & ATI GPUs, ATI ones having significantly higher peak performance but it’s irrelevant for most games – fps differs in percents while peak flops in times.

I assume that the device code used in 3D rendering is compiled down to something like the cubin in CUDA. The developers of that code probably got warning of the architectural changes and had time to reorganize their code to make better use of the extra 16 cores.

The performance of the 460 in CUDA is quite unique due to the new scheduler… that’s why it’s interesting and confusing to determine just how well it’s performing.

It really is time for some good standardized CUDA benchmarks to be assembled. Nontrivial apps that show real problems, not corner cases or raw fundamentals. (Which are useful an interesting by themselves, but don’t give a full feeling of performance.)

I bet that the GPU Gems 4 examples might be a good set of tools to build a multidimensional benchmark around. The book isn’t out yet, but the code is (See the bottom of this page.) There are also some SDK examples, especially nbody, which would work well as an element of a benchmark as well.

Combine the suite of tools with a correctness check (most SDK examples have a “Gold” validation), and the collection could serve not only as a benchmark but also a hardware check, or even overclocking stability tool.

My impression from the discussion in this thread and another one that I started (32 vs 48 cores) is that the present driver/compiler combo does not handle the 3rd 16 cores in a GTX 460 well. I suspect that the additional 16 cores do not get utilized at all. I have run more than a dozen benchmarks, none of them memory constrained, and, calculating cores from performance vs a GT 330M, I always end up with 28 to 35 cores per MP (ergo 32 cores). It would be great if NVIDIA people could comment on that. As I am in the Fermi camp for only two days now, I have not had time to play around with options like using float4s or the like as mentioned somewhere, but either NVIDIA needs to do some work to get the extra 16 cores utilized without tweaking codes or one has to rethink how to rewrite codes to activate those 16 cores. For now I am going to consider the GF104 to have 32 cores until proven wrong by benchmark results.

Could you post your benchmark results? This is exactly what would help us all analyze the 460.

Yeah if you’re going to compare it to something, compare against a GF100.

Some Possibilities are

  1. Compiler needs to be updated and existing CUDA apps recompiled against it.

  2. Code needs to be restructured to optimise for the new architecture. i.e. compiler can not hide this.

  3. Nvidia cripples CUDA and OpenCL on GTX 460 to just using the first two SP banks.
    The fact that they have made it compute version 2.1 suggest this isn’t the case.

  4. Number 3 until Tesla products ship with a compute version 2.1

My money is on #2

I have a point of reference for running my OpenCL N-queen solver for 17x17:

An factory overclocked GTX 460 (core/shader: 715MHz/1430MHz): 1.64 seconds
GTX 285 (normal clocked): 2.31 seconds
GTX 480 (tested by AnandTech): 1.12 seconds

The three are similarly clocked (all around 1.4GHz), so it’s interesting to see how it goes with per SP performance:

460 has 336 SP @ 1.430GHz, so it’s 787.9872 SP-Gclks
285 has 240 SP @ 1.476GHz, so it’s 818.2944 SP-Gclks
480 has 480 SP @ 1.401GHz, so it’s 753.1776 SP-Gclks

The number is the lower the better (which means less SP-Gclks is required for completing the same task). The program is all computation and extremely light on global memory, but uses shared memory quite a lot for arrays.
From this number it’s clear that GTX 460 uses more than 32 SP in OpenCL, but the efficiency is not as good as GTX 480.

Steve - I totally agree that standardized CUDA benchmarks should be developed. However could you please explain why you think we need a real application?

why not make the profiler (or any other tool released by nVidia) which will simply show everything you need to know about your kernel:

  1. GMEM bandwidth utilization.

  2. Compute related performance numbers.

  3. CUDA cores utilization which will show information about threads, blocks, wraps and other schedule related stuff.

  4. Instruction utilization

  5. Shared memory utilization

  6. PCI overhead

I think those should be obsolete numbers (oh oh and give us CORRECT numbers about the peak/sustained values per card - i.e. NOT marketing - so we can

see how good or poor our kernels do). Currently I dont really know if I utilize the GPU enough, how to improve it, where are the bottlenecks…

I know I can try to remove code, play around with loop unrolling, change smem usage to gmem and see how it effects performance - but why???

Furthermore what do I care how a sophisticated nBody application performs on a S2050??? I care how my application performs and what can I do to make it faster…

my 1 cent :)

eyal

We need them in addition to all the low level benchmarks we can do.

The reason I stress real apps is that raw throughput tests can be misinterpreted, especially in nearly all naive attempts to assign GPU a single number like Gigaflops.

Real apps also avoid wacko weird corner “benchmarks” that may show special cases that don’t happen in real programs, like some unrolled loop that that sees how fast you can add 1.0 to a million registers.

But of course you’re right that we’d love some hardware profiler or even static PTX analyzer/simulator that can identify each of the relevant utilizations and how close they come to peak. Not just the list you started but all the other inefficiencies like register stall delays, atomic serialization counts, bank conflicts… etc.

Back to the main topic, I’m still really unsure how my raytracing code will run on a 460 without just buying one and trying… it’s no longer a simple look at Mhz and SP count and memory bandwidth. This isn’t a bad thing, BTW, I’m all for architecture efficiency boosts, but it does make predictions harder. And if GF104 is sensitive to code reorganization or even compiler switches, we need to start worrying about optimizing for yet another target GPU with a different behavior.

I just got a GTX 460 for CUDA operations and measured a few early results. It looks like BBCode tables aren’t supported, so apologies for the formatting.

Some of these are consumer cards so I’m including the clock rate. The GTX 260 and 460 used runtime 3.1 and driver 256.40, the others used runtime 3.0 and driver 256.35.

[font=“Courier New”]

9800GX2	 G92	 CC 1.1 clock 1.5GHz   Device BW 52653 MB/s  16 SM, 128 cores

FX5800	  GT200GL CC 1.3 clock 1.3GHz   Device BW 70830 MB/s  30 SM, 240 cores

GTX 260/216 GT200b  CC 1.3 clock 1.35GHz  Device BW 97333 MB/s  27 SM, 216 cores

GTX 460	 GF104   CC 2.1 clock 1.45GHz  Device BW 61880 MB/s   7 SM, 336 cores

GTX 470	 GF100   CC 2.0 clock 1.22GHz  Device BW 92450 MB/s  14 SM, 448 cores

[/font]

The device memory bandwidth as reported by the 3.1 SDK bandwidthTest is interesting. In theory the bandwidth should be on par with the GTX 260 core 216, but instead it’s about two thirds. deviceQuery from 3.1 reports the GTX 460 as 224 cores, as mentioned elsewhere.

I debated what code to run and ended up using CFD code simulating natural convection in a 3D cavity. The code isn’t tuned specifically for any one platform though I give the 2.0 cards a larger thread block size. I could run other benchmarks if someone has a good idea of something to run. I figured this was real code and worked out better than toy benchmarks with 5 line kernels, though it’s definitely not a controlled experiment. None of the kernels do complicated math functions – it’s all operations on a radius-1 stencil of 3D data with adds, subtracts, and multiplies. The way the 3D domain is processed by each kernel is not too dissimilar to that described in Micikevicius’s 2009 paper on 3D finite difference computations. Timing was done with Events wrapped around the kernel calls. I’m counting operations as FP ops in the C code – I did not look deeper to see what optimizations were done.

Kernel A does 39 FP operations per cell and uses 48 (CC 2.0) / 33 (CC 1.3) registers. It isn’t using shared memory.

Kernel B does 192 FP operations per cell and uses 63 (CC 2.0) / 53 (CC 1.3) registers. It’s quite dense, and does not use shared memory.

Kernel C is very simple, with 11 FP operations per cell and uses 21 (CC 2.0) / 15 (CC 1.3) registers. I have measured a global and a shared memory version shown as Cg and Cs.

The 9800GX2 ran a smaller domain size since it has less memory. The others ran the same size with about 800MB of global memory used. The problem size was halved for double precision so it used about the same device memory.

Single Precision, numbers in GFLOPS

[font=“Courier New”]

9800GX2   A:  7.6	B:  13.7	Cg:  5.7	Cs: 17.0

FX5800	A: 26.0	B:  67.0	Cg: 21.4	Cs: 40.0

GTX 260   A: 27.7	B:  76.7	Cg: 23.4	Cs: 48.7

GTX 460   A: 37.8	B:  98.5	Cg: 46.2	Cs: 36.0

GTX 470   A: 51.5	B: 154.3	Cg: 42.3	Cs: 48.5

[/font]

This looks pretty good, though the 470 is definitely faster. While shared memory is a huge win on the pre-2.0 cards, it has been much less so on the 470 for us (not a big surprise). Strangely the same kernel run on the 460 comes out with shared memory running much slower than the simple global memory version. It could be an issue that could be resolved with tuning.

Double Precision, numbers in GFLOPS

[font=“Courier New”]

FX5800	A: 16.1	B:  27.8	Cg: 12.8	Cs: 21.2

GTX 260   A: 18.3	B:  38.3	Cg: 16.2	Cs: 21.5

GTX 460   A: 20.0	B:  26.5	Cg: 21.4	Cs: 21.2

GTX 470   A: 25.2	B:  30.6	Cg: 20.1	Cs: 31.8

[/font]

Kernel B sees a big hit running in double precision, while the rest are hit much less so – they’re presumably much more bandwidth limited.

Again, this is just a quick look at how some of my kernels performed for me with the GTX 460. I thought I’d share since a number of people has asked for some CUDA results.

Is gtx260 faster is double precision?

It would take a lot more testing, tuning parameters per GPU, and looking at the profiler to make a more general conclusion. For my dense kernel with these parameters, the GTX 260 core 216 came out faster than anything else, and I’ve repeated the test a few times with the same result. I tried varying the thread block size for the GTX 460 but the other values I used were somewhat slower.

What version did you get 1024Mb or 768Mb ?

I usually end up using these: https://www.cs.virginia.edu/~skadron/wiki/r…x.php/Main_Page

I got the MSI Cyclone 1GB OC. I’d have seriously considered a 2GB version if one was available. It’s in a headless i7 system running Fedora 12.

The Rodinia suite looks interesting. I might try that this weekend.

I found this interesting thread.

[url=“http://forum.beyond3d.com/showthread.php?t=58077”]http://forum.beyond3d.com/showthread.php?t=58077[/url]

Posted by pcchen

Aha I found one:

e = b * e + 2.0f;
e2 = b2 * b2 + e2;
b = e * b + 2.0f;
b2 = e2 * e2 + b2;

This is able to do 95.573 FLOP/cycle/MP. Further modification on e2 and b2 slows it down though.
So it still looks like a register bandwidth/allocation thing.

Which suggest that its possible to optimise for the GTX 460.

What is the maximum number of concurrent kernels? I assume it must be different to the GF 100.

I wish Nvidia would release some kind of GF104 Optimisation guide. I assume this will be in the next Fermi Optimisation guide.

Nice catch…
So it seems GF104 suffers from the same register file bandwidth problems that were preventing MAD+MUL dual-issue on G80…

We can do some back-of-the-envelope calculations to explain this:
An SM has 128KB worth of registers. Most likely, it is split into 32 banks of 128x256-bit (the sweet spot with current CMOS technology). (For a RF that huge, don’t even think about multiported memory…)
So assuming perfect load balancing and no bank conflict, the highest total bandwidth we can get is 32x256-bit, that is, 8 warp-sized vectors of floats.

So the SM cannot issue 3 FMAs/cycle if they all require distinct operands (9 total).
pcchen’s code needs 6 operands for 3 FMAs, so it can reach peak throughput.

In theory there should be some patterns that reach peak RF bandwidth (like FMA3, FMA3, FMA2).