Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :)

I got my GTX470 this afternoon… I just camped out on all the retail sites and set a kitchen timer to check them every 20 minutes. I finally found a GTX470 on NewEgg. Not the multiple GTX480s I really want, but since those may be impossible for a while, I grabbed the 470. I also sprung for the overnight shipping.

Installed the board this afternoon and booted up Linux. I ran some tests for Stan (which did run) but things were acting “funky”.
In fact my own Monte Carlo code wasn’t running on it properly (like 10X slower than I expected, and didn’t terminate).
And then I noticed my OTHER gpus were also acting incorrectly and slow and some kernel launches on them were failing. In Linux, nvidia-settings was claiming the GTX470 had 0 shader clock and 400Mhz memory clock even while it was running CUDA code.

Booted into Windows 7, on the same machine, it reinstalled drivers and rebooted, but when it came back was still recognized it only as “VGA Device”. GPU-Z showed it could see the board as DirectX11 and 448 shaders, but not being CUDA or physx capable. It also didn’t know the name of the card. It also reported the PCIE as 1.0 x16, not PCIE 2.0 x16.

Drivers of course were the biggest suspect, but I have the latest 195.x drivers in Linux and 197.x drivers in Windows.

Removing the GTX470, then the other GPUs resumed working fine. Installed it again in a different PCIE slot… still bad.
Its also not a PSU problem, I have a 1200 watt PSU and I had been running 2x GTX295 + GT240 previously.

So, after a couple hours trying other variations, it looks like I have a bum GTX470 board. I’ll try on another PC later this week.
Sigh… I was so eager to be the first to post CUDA benches but I guess not.

BTW this is a Gigabyte GTX470. I doubt quality is brand-centric though.

As I understand it, all of these references boards are being made by one AIB manufacturer, so it shouldn’t be anything specific to Gigabyte (or at least I hope not because I have a coupe of the same card on order…)

Yep. Some differences MIGHT be in their inhouse testing.

And eVGA does some extra binning to find the best behaved chips for their “superclocked” cards. (Which means you shouldn’t buy the eVGA standard clock boards since those you KNOW have literally been picked over and are therefore inferior to the average board.) Biggest brand differentiator is warantee and support.

I also think the AIB makers might have their own plastic shrouds. But I’m not sure of that. Again, shouldn’t matter.

But for now, brand doesn’t matter. I’m back to searching for GTX480s… probably will end up paying too much on eBay, but hey, it’s all a business expense anyway.


BTW, the NV forums still seem to be down to IE, Firefox, and Chrome. We’d see a lot more posts otherwise. I had to break out my old mac to get to the forum. (see the thread on the NVIDIA Forum section of the forum.)

works on Chrome for Windows here, but not on Firefox. After cleaning out all my cookies, it began working again on Firefox as well.

Have the same issue over here, will try removing cookies

What exact PCI device ID does the board have (the windows device manager would tell you)?

If the device ID matches up with what is available in the nvdisp.inf file in the nVidia drivers, then why would it still how up as “VGA Device”? That is weird!!!

Did you try to reinstall drivers (with complete removing first) and maybe directx after instert new card? I think this should be software problem. See, cuda is nothing to do with mainboard etc.

Could authors comment test results? Looks like it is only 1.5 times faster in dp. And two times faster in two sp tests.

Sorry to hear about your bad board. My GTX470 (due in tomorrow) is the factory overclocked EVGA (pure chance - I was just ordering what was available). We’ll see if I have better luck.

I have already taken some requests for benchmarks, and I am willing to run others if they post a link in this thread. I should have the numbers up here sometime tomorrow afternoon.

So that’s what that was. I tried Firefox and IE, to no avail. I guess the problem is fixed now.

NV forums were down for Safari yesterday (4/13) but seem to be OK today. I was worried as I just posted some complaints about DP crippling on GTX 400 cards, and was worried I was being targeted. Just my usual paranoia, I suppose.

Regards,

Martin

Thanks! Let’s see if we can interpret these:

Interesting, the efficiency metric has gone down slightly relative to the GTX 285 and the Tesla C2050. It’s possible that the stream processors are outrunning the memory bandwidth in this kernel now:

GTX 285 peak bandwidth/BogoGFLOPS = 0.224

GTX 480 peak bandwidth/BogoGFLOPS = 0.132

I guess this isn’t surprising since the GTX 480 tips the balance toward more calculations per read (perhaps hoping that the L2 cache will cover for some of relative loss in memory bandwidth). The ratio of double to single precision efficiency in this test is ~1/5, much like the GTX 285.

And, we see that the fantastic global atomic performance is present in the GTX 480.

If we scale down by BogoGFLOPS, then the GTX 480 gives you an efficiency metric of 2005.8 (units irrelevant), and the GTX 285 gives you 1847.8. So the 480 seems to be slightly more efficient in this calculation as well. It would probably be useful to write a test which discovers how much the L2 cache helps when broadcasting the same data to every block.

That seems like a strange default, since shared memory size was 16 k before.

Or is it that you think people will in general have more benefit from more shared memory?

I know what you mean because it was a 403 Forbidden/Access Denied error, not a 404 File Not Found.

I think this allows to run twice number of blocks cause register file was doubled too. It is obviouse default setting. Though programs with small amount of shared memory per block could benefit from 48L/16S more.

This is exactly correct.

In fact Fermi really encourages the use of smaller blocks since it can schedule not just more blocks per SM (due to the tripled shared mem, and doubled register count) but those blocks don’t even need to be from the same kernel. (But they do need to be from the same context.) This could help efficiency quite a bit in some situations, basically removing much of the scheduling losses due to idle SMs since now it’s easier for an SM with free resources to select among multiple work jobs to replenish itself.

Thanks for running this, indy2718!

Interpretation for everyone’s benefit:

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning

Running gpu_binning microbenchmark: 64000 3.800000 0.200000

....

GPU/simple		  : 0.118869 ms

Very interesting: this definitely shows the speed increases in global atomics. This code takes 64000 particles in a box and bucket sorts them into bins on a grid. The GPU/simple kernel is the simplest possible way one might think of doing this: Run 1 thread per particle, determine the bin index, atomicInc the bin counter at that index and place the particle id in the bin. This kernel is extremely light in terms of computation in each thread, but exercises the atomic operations quite a bit. Particles are sorted so that particles near each other in space are near each other in index.

The fact that this super simple kernel makes out at 0.11ms on a Fermi card is sweet. That is 11.3 times faster than the same kernel run on a Tesla S1070! And is now a 10-20x improvement over the host (depending on the host). This is on an operation that is extremely GPU unfriendly.

GPU/simple/sort/ 32 : 0.306458 ms

GPU/simple/sort/ 64 : 0.219647 ms

GPU/simple/sort/128 : 0.243106 ms

GPU/simple/sort/256 : 0.283932 ms

GPU/simple/sort/512 : 0.333723 ms

These methods get cute by doing a local per block sort and scan: only doing one atomicAdd for each set of particles that go to the same bin from that block. They are all slower on Fermi than the simple method, very interesting. On Tesla S1070, these methods were 2-3x faster than the simple method due to the reduced number of atomicAdds needed.

Above there are 60-100 particles sorted into each bin. The next run uses a smaller bin size so that only 1-3 particles are in each bin.

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning 64000 1.12 0.2

Running gpu_binning microbenchmark: 64000 1.120000 0.200000

...

GPU/simple		  : 0.090810 ms

Performance is a little faster than above (less contention of many threads atomicIncing the same location.) but still fairly flat. Similar behavior to Tesla S1070.

Hopefully this is the first example of many that with Fermi’s cache, the simplest possible way of writing a kernel is the fastest!

Could anyone post output from deviceQuery on 470/480GTX? Here is ours – the number of cores looks suspicious

Device 0: "GeForce GTX 480"

  CUDA Driver Version:						   3.0

  CUDA Runtime Version:						  3.0

  CUDA Capability Major revision number:		 2

  CUDA Capability Minor revision number:		 0

  Total amount of global memory:				 1609760768 bytes

  Number of multiprocessors:					 15

  Number of cores:							   120

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   49152 bytes

  Total number of registers available per block: 32768

  Warp size:									 32

  Maximum number of threads per block:		   1024

  Maximum sizes of each dimension of a block:	1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  2147483647 bytes

  Texture alignment:							 512 bytes

  Clock rate:									1.40 GHz

  Concurrent copy and execution:				 Yes

  Run time limit on kernels:					 No

  Integrated:									No

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

There is no field in the device properties structure telling you how many cores there are directly. Device query is incorrectly assuming that there are 8 cores per multiprocessor, which is not true on GF100. To figure out how many cores there are in my test code, I have to check the compute capability number and then multiply the number of multiprocessors by 8 for capability < 2.0 and 32 for capability >= 2.0.

SM count is the only thing that matters.

Could somebody also compare sdk examples? At least, everybody has it. N-body example is interesting to confirm these numbers.