Attention Lucky GTX 480/GTX 470 Owners! Please run some benchmarks for us. :)

Since I sadly don’t have the spare funds (at work or home) to pick up a GTX 480 or GTX 470 immediately, I wanted to ask if those of you who do receive your cards in the next few days could run some tests for me:

http://bitbucket.org/seibert/fermi_test/src/

There are two self-contained .cu files in that directory. (If you don’t have mercurial installed, you can download them individually over the web easily.) I would be very grateful if you could download each, compile them with -arch sm_20, and run them on your new hotness.

tmurry was nice enough to run these for me on an as-yet-unreleased Tesla card, but I need to check with him about how much of those numbers I can post without getting him in trouble. :)

pdfs.cu runs some “tests inspired by real CUDA code” which do something like kernel estimation (not “kernel” as in CUDA kernel, though) and do histogramming purely through abuse of atomics. Based on tmurray’s run of this code, I agree with him that atomic performance is insanely great on Fermi now.

rayleigh.cu does a Rayleigh power calculation (also a test inspired by real CUDA code I’ve used) where I want to see how much the L2 cache can help more efficiently broadcast data to all blocks.

Others should feel free to jump in with .cu files they want to see run on Fermi, and hopefully kind owners will oblige. :)

Do you mean shared or global memory atomics?

To start things off with a baseline, this is the output from those two programs on a GTX 285:

pdfs.cu:

Device name: GeForce GTX 285

BogoGFLOPS: 708.5

Single precision: time = 279.620 ms, efficiency metric = 11.91

Double precision: time = 1459.496 ms, efficiency metric = 2.28

Atomic abuse: time = 2.962 ms, events/sec = 51855.5, events/sec/bogoGFLOP = 73.19

Comments: I’ve tried to normalize out the effect of clock rate * stream processors on these benchmarks in order to see how much more efficient the Fermi architecture is than the GT200. BogoGFLOPS is defined to be 2 * [clock rate] * [multiprocessors] * [8 if GT200, 32 if Fermi]. In the first two tests, the efficiency metric is pretty much [workload] / [BogoGFLOPS], so you should get roughly the same value on all chips if it is purely compute bound. Notice that the double precision version is only 1/5 the speed of the single precision and not 1/8, because other things are going on in the kernel besides double precision instructions.

Atomic abuse is showing that a simplistic histogramming algorithm can bin 52kevents/sec on GT200, and this should be way, way better on GF100.

rayleigh.cu:

Device name: GeForce GTX 285

BogoGFLOPS: 708.5

Rayleigh power: time = 1955.373 ms, event*freq/sec = 1309213.4

Here I didn’t bother to normalize by BogoFLOPS. This kernel uses a mixture of single and double precision.

Global atomics. I expect the “atomic abuse” code could be made much faster by doing histogramming with shared atomics within each block and then incrementing the global bin counters. (Assuming the histogram fits in shared memory.) Nevertheless, I wanted to see how much better global atomics are now.

Thanks. I think shared atomic test would be cool too. I just yesterday got rid of shared atomics in my program because of they were so slow. Now I think to keep that code path in case they were improved.

I haven’t heard anything about shared memory atomics being improved. The big benefit Fermi gives to global atomics is that they are done in the L2 cache and not device memory.

On the topic of improved atomics on Fermi, I’ve got a single cu file benchmark I’d like to see run on one. Get the cu file here: https://codeblue.umich.edu/hoomd-blue/trac/…/gpu_binning.cu (there is an “original format” link at the bottom).

Compile with

$ nvcc -arch=sm_20 -DCUDA_ARCH=20 -o gpu_binning gpu_binning.cu

Then run it twice with the following settings:

./gpu_binning

...

./gpu_binning 64000 1.12 0.2

For reference, here is the performance on a Tesla S1070

$ ./gpu_binning 

Running gpu_binning microbenchmark: 64000 3.800000 0.200000

sorting....

	done.

Host				: 1.084418 ms

Host w/device memcpy: 1.573579 ms

GPU/simple		  : 1.256065 ms

GPU/simple/sort/ 32 : 0.388527 ms

GPU/simple/sort/ 64 : 0.358381 ms

GPU/simple/sort/128 : 0.380386 ms

GPU/simple/sort/256 : 0.476812 ms

GPU/simple/sort/512 : 0.603767 ms

GPU/update		  : 2.561406 ms

[joaander@ac ~]$ ./gpu_binning 64000 1.12 0.2

Running gpu_binning microbenchmark: 64000 1.120000 0.200000

sorting....

	done.

Host				: 1.750725 ms

Host w/device memcpy: 12.637670 ms

GPU/simple		  : 0.314473 ms

GPU/simple/sort/ 32 : 0.408186 ms

GPU/simple/sort/ 64 : 0.399517 ms

GPU/simple/sort/128 : 0.430154 ms

GPU/simple/sort/256 : 0.528544 ms

GPU/simple/sort/512 : 0.665140 ms

GPU/update		  : 2.719277 ms

why "-arch=sm_13 -DCUDA_ARCH=13 " not 20?

because i copied and pasted the command I was testing with on the Tesla S1070 box :) Edited the post to fix the typo.

I have a very special piece of code that runs in 120,000 cycles on GT200 using extreme atomic abuse. I was able to do some trickery (take six weeks, remove all atomics, insert horrible black magic, ensure that nobody but me will ever understand the code) and knock that down to ~18,000 cycles.

I put that exact same piece of atomic abusing code on GF100 (a C2050, but this code doesn’t do any meaningful arithmetic), and it’s running at 5700 cycles. Whee!

And for tmurray’s GF100-based Tesla, here are sanitized results that we can compare to the GTX 470/480:

pdfs.cu:

Single precision: <redacted> efficiency metric = 11.75

Double precision: <redacted> efficiency metric = 4.24

Atomic abuse: <redacted> events/sec/bogoGFLOP = 1179.73

Three important things to note here:

  • The single precision efficiency for this kernel is basically the same for both devices, suggesting that this is a compute bound kernel not getting any benefit from the L2 cache.

  • The double precision efficiency is 1/2.77 of the single precision, so we are not achieving the peak DP flops even with this relative simple kernel. It’s possible that doubling the size of inputs has made this a partially bandwidth bound kernel.

  • Holy cow, global atomics are super fast now.

“The single precision efficiency for this kernel is basically the same for both devices, suggesting that this is a compute bound kernel not getting any benefit from the L2 cache.”

So, Fermi should be two times faster in single precision?

For the GTX 480, yeah, pretty close for compute bound stuff. Twice the number of stream processors but a slightly lower clock rate than the GTX 285.

Basically, you could think of the GTX 480 as a slightly faster GTX 295, without the hassle of multiGPU programming. :) (and slightly less aggregate memory bandwidth)

For the record, I asked seibert to redact the exact values because I’m not 100% sure that my GF100s are indicative of final performance (in terms of clocks).

I think Fermi could show its power on complex cuda applications. Personaly I hope for increased register file. Though I am a bit aware of decreasing of number of stream processors. Could it lead to worse performance in code with big level of divergence? Btw, how is cache configurated by default? There is a lot of tuning.

Cache is 48k shared/16k L1 by default.

Btw, in principle, compiler could analyze all kernels shared memory and register use and set best mode by itself for old cuda programs. For example, if kernels do not require big shared memory size, compiler could switch to 16/48. Usually if kernels do not use much shared memory they are memory bound and could benefit form large L1 cache. We could face some unpleasant cuda fermi benchmarks on old cuda programs, becase of they are not tuned.

That’s not really true at all. Compiler can’t figure out optimal occupancy, whether you’ll get any benefit from L1, or all sorts of things like that.

Many cuda programs use constant compiler time known block size. Also often shared memory use is proportional to block size. With small amount of shared memory per block 48KB of shared memory is out of use and it is known at compile time.

And the compiler that generates CUDA modules is separate from the host compiler, and the occupancy issues are a major problem, and changing cache size prevents concurrent kernels which may or may not matter a lot to your application, etc. So, to summarize: this is absolutely not something that can be generated in a meaningful way at compile time. That’s why there’s one default that you may be able to tune for better performance.

System:

QX9650 3.0 ghz quad

ASUS P5E

8 GB memory

single EVGA 480 GTX on X without composite.

[codebox]

x@desktop:/home/x/fermi/fermi_test$ nvcc -arch sm_20 pdfs.cu

time x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Single precision: time = 170.379 ms, efficiency metric = 10.30

Double precision: time = 882.484 ms, efficiency metric = 1.99

Atomic abuse: time = 0.105 ms, events/sec = 1469687.8, events/sec/bogoGFLOP = 1092.74

real 0m2.204s

user 0m2.108s

sys 0m0.092s

x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Single precision: time = 170.401 ms, efficiency metric = 10.29

Double precision: time = 882.486 ms, efficiency metric = 1.99

Atomic abuse: time = 0.105 ms, events/sec = 1467441.1, events/sec/bogoGFLOP = 1091.07

real 0m2.211s

user 0m2.112s

sys 0m0.096s

x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Single precision: time = 170.388 ms, efficiency metric = 10.30

Double precision: time = 882.493 ms, efficiency metric = 1.99

Atomic abuse: time = 0.105 ms, events/sec = 1467441.1, events/sec/bogoGFLOP = 1091.07

real 0m2.207s

user 0m2.116s

sys 0m0.088s

x@desktop:/home/x/fermi/fermi_test$ nvcc -arch sm_20 rayleigh.cu

x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Rayleigh power: time = 948.915 ms, event*freq/sec = 2697817.5

real 0m1.991s

user 0m1.896s

sys 0m0.092s

x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Rayleigh power: time = 948.944 ms, event*freq/sec = 2697736.5

real 0m1.996s

user 0m1.900s

sys 0m0.092s

x@desktop:/home/x/fermi/fermi_test$ time ./a.out

Device name: GeForce GTX 480

BogoGFLOPS: 1345.0

Rayleigh power: time = 948.941 ms, event*freq/sec = 2697743.2

real 0m1.999s

user 0m1.896s

sys 0m0.100s


x@desktop:/home/x/fermi/fermi_test$ nvcc -arch=sm_20 -DCUDA_ARCH=20 -o gpu_binning gpu_binning.cu

./gpu_binning.cu(577): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(573): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(644): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(577): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(573): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(644): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(577): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(573): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(644): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(577): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(573): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(644): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(577): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(573): Advisory: Loop was not unrolled, cannot deduce loop trip count

./gpu_binning.cu(644): Advisory: Loop was not unrolled, cannot deduce loop trip count

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning

Running gpu_binning microbenchmark: 64000 3.800000 0.200000

sorting…

done.

Host : 1.805127 ms

Host w/device memcpy: 2.698395 ms

GPU/simple : 0.118869 ms

GPU/simple/sort/ 32 : 0.306458 ms

GPU/simple/sort/ 64 : 0.219647 ms

GPU/simple/sort/128 : 0.243106 ms

GPU/simple/sort/256 : 0.283932 ms

GPU/simple/sort/512 : 0.333723 ms

GPU/update : 1.536448 ms

real 0m58.531s

user 0m58.408s

sys 0m0.108s

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning 64000 1.12 0.2

Running gpu_binning microbenchmark: 64000 1.120000 0.200000

sorting…

done.

Host : 2.724192 ms

Host w/device memcpy: 26.939022 ms

GPU/simple : 0.090810 ms

GPU/simple/sort/ 32 : 0.312103 ms

GPU/simple/sort/ 64 : 0.216333 ms

GPU/simple/sort/128 : 0.234767 ms

GPU/simple/sort/256 : 0.279113 ms

GPU/simple/sort/512 : 0.328357 ms

GPU/update : 1.223797 ms

real 1m23.604s

user 1m23.409s

sys 0m0.164s

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning

Running gpu_binning microbenchmark: 64000 3.800000 0.200000

sorting…

done.

Host : 1.297337 ms

Host w/device memcpy: 2.168851 ms

GPU/simple : 0.115417 ms

GPU/simple/sort/ 32 : 0.303133 ms

GPU/simple/sort/ 64 : 0.216431 ms

GPU/simple/sort/128 : 0.239582 ms

GPU/simple/sort/256 : 0.280600 ms

GPU/simple/sort/512 : 0.330431 ms

GPU/update : 1.530422 ms

real 0m41.816s

user 0m41.691s

sys 0m0.116s

x@desktop:/home/x/fermi/fermi_test$ time ./gpu_binning 64000 1.12 0.2

Running gpu_binning microbenchmark: 64000 1.120000 0.200000

sorting…

done.

Host : 2.351914 ms

Host w/device memcpy: 27.406826 ms

GPU/simple : 0.090793 ms

GPU/simple/sort/ 32 : 0.312163 ms

GPU/simple/sort/ 64 : 0.216422 ms

GPU/simple/sort/128 : 0.234768 ms

GPU/simple/sort/256 : 0.279076 ms

GPU/simple/sort/512 : 0.326411 ms

GPU/update : 1.222879 ms

real 1m10.587s

user 1m10.372s

sys 0m0.184s

[/codebox]