In the Kepler whitepaper it says that atomic functions for a shared address is about 11 times faster with Kepler. Is this only true for Kepler Tesla, or also for GTX 680?
I had been meaning to dust off one of my microbenchmarks from the Fermi release:
http://bitbucket.org/seibert/fermi_test/src/c20b947d02f9/pdfs.cu
[Edit: fixed link to correct version of file]
(Compile with: nvcc -o pdfs -arch [put arch of card here] pdfs.cu)
This benchmark does some single and double precision 1D kernel (not as in CUDA kernel, but the math sense) density estimation, but also computes a histogram using random data and atomics on global memory addresses. This tests the atomic speed with independent addresses for each thread.
For comparison, here it is for a GTX 295, 580 and 680 in the same computer:
Device name: GeForce GTX 680
BogoGFLOPS: 3251.7
Single precision: time = 816.824 ms, efficiency metric = 14.21
Double precision: time = 13302.265 ms, efficiency metric = 0.87
Atomic abuse: time = 0.087 ms, events/sec = 7097966.5, events/sec/bogoGFLOP = 2182.84
----
Device name: GeForce GTX 580
BogoGFLOPS: 1581.1
Single precision: time = 1417.309 ms, efficiency metric = 16.85
Double precision: time = 9052.771 ms, efficiency metric = 2.64
Atomic abuse: time = 0.376 ms, events/sec = 1632791.9, events/sec/bogoGFLOP = 1032.72
----
Device name: GeForce GTX 295
BogoGFLOPS: 596.2
Single precision: time = 4865.107 ms, efficiency metric = 13.02
Double precision: time = 26113.904 ms, efficiency metric = 2.42
Atomic abuse: time = 13.974 ms, events/sec = 43968.8, events/sec/bogoGFLOP = 73.75
“BogoGFLOPs” = 2 * clock rate * # of CUDA cores, which is a handy way to normalize things to understand architecture differences.
“Efficiency metric” = proportional to [number of calculations / elapsed time / bogoGFLOPS]
Looking at the events/sec in the atomic test, we see that the GTX 680 is 4.3x faster on the atomic test, which only has 200 bins. That’s even a little faster than the 3.5x quoted in the Kepler whitepaper, probably because sometimes two threads increment the same bin at the same time.
To test the performance of atomics with a shared address, I tweaked the atomic test to compute the same bin for every event being histogrammed. Here’s what I get:
Device name: GeForce GTX 580
BogoGFLOPS: 1581.1
Atomic abuse: time = 6.852 ms, events/sec = 89673.1, events/sec/bogoGFLOP = 56.72
----
Device name: GeForce GTX 680
BogoGFLOPS: 3251.7
Atomic abuse: time = 0.591 ms, events/sec = 1039579.8, events/sec/bogoGFLOP = 319.70
In terms of events/sec, the GTX 680 is 11.6x faster than the GTX 580. This is almost exactly the 11.7x claimed in the whitepaper.
So yes, the GeForce GTX 680 does provide the large improvement in atomic performance promised for Kepler.
Thanks seibert!
Very interesting thank you!