Taking apart global atomics performance performance, graphs, theories

Steven Robertson (srobertson on these forums) is stress-testing the global atomics implementation on the G200b chip architecture.

http://strobe.cc/articles/cuda_atomics/

Some great (and puzzling) performance numbers there. Check back frequently, the article is growing day by day ;)

Christian

I am actually of wondering whether the alledged performance benefit when using global atomics with 1, 2 and 4 warps per SM can actually be seen and exploited in real number crunching applications, or whether it’s an artifact of the method that was used for measurements.

Christian

“Alleged”? Ooh, fighting words. :)

Practical tests with a random number generator to simulate compute load will be done soon, but I don’t believe that this had much to do with the benchmarking method. I’ve updated the site with a bit more explanation on why this is so, but in short, it appears that the ROPs are being used to handle the atomic operations. I’ve got more tests to run which may confirm this suspicion more directly.

With regard to the fractal flame algorithm, I think that with a few optimizations (which will also be benchmarked at the above link), using atomics to scatter to the framebuffer will actually be both the simplest and fastest approach. Of course, the article’s still being updated, so keep an eye out; I’ll post back when it’s complete.

Well, except for my spelling mistake, the word was carefully chosen. ;) A little bit of scepticism is always in place when seeing results that seem to defy logic. But your theory about the ROPs taking care of the atomics seems reasonable.

If I implemented my scatter using the openGL pipeline (blending pixels into a target buffer), it would effectively be using the same hardware and thus should achieve a similar peak performance.

Maybe consider the possibility that while these atomic instructions may exhibit lower latencies (in terms of clock cycles), they may be more bandwidth limited than regular global memory accesses from CUDA. This is certainly something that asks for a benchmark.

Christian

Steven,

In your experiments, you only tested the “reduction” memory ops, didn’t you?
(when you don’t use the result of an atomic, it seems that ptxas generates a reduction op instead, like the PTX red instruction)

It would be interesting to try with “true” atomics, where there exists a dependency from the value returned by the atomic operation. My guess is that the latency will suddenly get much higher… :)

Your Python code looks interesting. Could it be used to generate assembly code directly, rather than PTX?

I should be the first to note here that I might be horribly wrong about all of this. I found a… well, a bug, I guess. It doesn’t really invalidate any of the previous numbers, which are still correct; it just means they are completely inapplicable to kernels that last longer than a few microseconds. (More details at original page.) I’m currently reanalyzing things, and have more or less retracted my results until I can explain the discrepancies, but it’s looking right now that, at least for kernels that do mostly IO, atomics are slower just like everyone expected.

(I’m now extremely happy that I wrote “appears” and “suspicion” in my response to cbuchner1.)

Sylvain, a quick test indicates that ‘red’ and ‘atom’ are roughly equivalent in terms of latencies (they’re probably identical, the stdevs were much larger than the differences). I need to take things apart with a disassembler to be sure, but last time I did so (a year ago, maybe) ‘red’ and ‘atom’ actually used the same opcodes when compiled, at least on an 8800. And while Python’s string.Template is one way to generate assembly with certain values or execution paths baked in, I would recommend a real templating language for that, like Mako or Jinja2; my little script is small enough to get away without it, but it would be messy if the kernels were much larger. Of course, LLVM is designed to support that sort of thing, and it works with higher-level languages, too.

Humbly and carefully from now on,
Steven

Don’t worry, this happens to me all the time too. :)

There are two assumptions that I would challenge:

  • That you “get the determinism of assembly languages” with PTX. PTX has an assembly-like syntax, but is actually quite a high-level language. In my experience, a lot of things happen between PTX and assembly, instructions get reordered, dead code is eliminated, constants and assignment are propagated… (all the usual compiler optimizations).

  • That you can measure the latency of a load with this sequence of instructions:

mov.u32		 clka,   %clock;

ld.global.u32 clkoa, [base];

mov.u32		 clkb,   %clock;

Even if this was written in assembly language, the hardware has some out-of-order processing capabilities to process loads asynchronously. That is, executing the ld instruction will be almost instantaneous (4 clocks), but the stall will occur only when the resulting register will be used as an operand in a subsequent instruction, which can be many instructions below.

In my own benchmarks (latency, throughput), I use a something like:

mov.u32 clka, %clock;

ld.global.u32 value, [base];

xor.b32 clka, clka, value;

mov.u32 clkb, %clock;

xor.b32 clka, clka, value;

True, but did you use the return value of the atom instruction in your test? Not so sure myself, but I think I remember that ptxas was optimizing the atom instructions by turning them into red whenever the result was not needed. So it would look like it maps to the same opcode, which is actually a red.

By the way, your latency results seemed rather consistent to me so far, for a red operation.

My own tests indicate roughly the same latency for atomics as for reads.

Thank for the insight. If the test code is written in Python, it might also be possible to plug it directly to cudasm/decuda without having to emit code and parse it back? (just thinking)

I like this thread, even when I’m on vacation. I’d like to see it keep going.

(Too bad you’ll have to redo it all for Fermi!)

Redoing things because they got faster is never bad ;)

Yes, you’re right. I will rework that footnote to be more accurate.

Oh, I am a fscking idiot. This is is exactly what was going wrong with the original “magic speedups”: out-of-order execution was letting the entire kernel finish before a single memory operation completed for some configurations. I knew that OOOE was possible, either on-chip or at compile time, but I had no idea that it was capable of letting execution run so far ahead of a queued memory operation.

Now that I have a real explanation for this speedup, I’ll be investigating the characteristics of the OOOE WRT deferring memory stalls. Unless someone has beaten me to it. I shall search first.

I thought you were saying “assembly” as in “another assembly language”, vs “output of ptxas”. I will look into using cudasm with pycuda.

Sylvain, kudos for saying thanks for the insight instead of that’s not what I meant, you idiot, and for the politeness in general. I appreciate it, as well as everyone’s patience and forgiveness while I develop my understanding of the hardware.

Steven

I don’t remember seeing a formal analysis of this mechanism. Some quick tests I’ve done some time ago were suggesting OOO completion for global/local loads (and stores), but not for other instructions. NVIDIA also owns some IP on scoreboarding systems (cited [post=“953140”]here[/post]).

You get the same timings with red and atom instructions, so maybe I was just wrong about this. Thanks for adding this test.

What I would have expected is that red would be the same latency as a write (fire-and-forget operation), and that atom would be the same latency as a read (the value returned by atom is the older value from memory, so the critical path is the same as a load). But your results show something else.

Trying to make some sense out of the “all conflicts” plot: 16 warps/block and 30 blocks means 15360 individual atomic transactions to the same address.

Assuming they all get serialized, this would mean each transaction takes 39 cycles, or 28ns. Not so bad!

If the serialization assumption is correct, that would mean that atomics are handled very close to the memory controller.

The Tesla (or Fermi) instruction set is definitely “another assembly language”, so your answer was perfectly sensible. ;)

It’s only afterward that I thought “hey, what’s the need of going through the effort of writing Python code that generates assembly code just so that it can be parsed back by another Python program?”.

And to return you the flattery: I’m really interested in understanding the architecture of GPUs, so I welcome, appreciate and wish to encourage any attempt to extend this knowledge…

Remember, the chip’s flooded with IO for the benchmarks that are posted. As-yet-unposted benchmarks (I’m being cautious this time!) which add a little bit of compute in there show that the latency is indeed lower for red operations, as long as the memory controller and/or bus aren’t flooded and rejecting requests. (This comment is of course speculation, but I hope to back it up with more evidence soon.)

I’ve grabbed a few books from the library on bus and memory controller design, and I may order more recent ones from my school’s interlibrary loan or perhaps just straight from Amazon. Hopefully this will lead to a more complete analysis, along with the IP links you shared.

Decently major update posted ( http://strobe.cc/articles/cuda_atomics/ if you are too lazy to scroll up). I reserve the right to change the horrible color schemes I used for all graphs posted.

Another update posted, covering the width of atomic locks.

Quite interesting - but as TMurray said: Fermi will change everything. ;)

My reaction to atomic performance on Fermi is somewhere along the lines of glee. Maybe maniacal glee, even.

We (as in those of us who can’t play with the new toys yet) have suspected that the atomic memory transactions in Fermi would be handled in the L2 cache rather than the memory controller, which would provide a huge performance improvement for such operations when all the atomically updated memory locations can fit in the cache. I will take your maniacal glee as tentative confirmation of this and add my own excitement. :)

You guys have (probably) just made histogramming WAAAY faster and (more importantly) easier to code up in CUDA. One common algorithm for us in high energy physics is to rebuild a histogram from a list of values over and over again, but each time applying a different global transformation to the values before binning. (Re-histogramming is usually part of some kind of likelihood or chi^2 optimization algorithm where we are trying to find the most probable value for some parameters in a model, including parameters which control these global transformations.) It sounds like this is the kind of problem that Fermi will eat for breakfast.

My credit card is already preheated for the first GeForce card packing a Fermi chip. :)

Hmm… Are you sure there would be a low cost GeForce version too?

The rumour mills have NVIDIA paper launching the Fermi base Geforce “GF100” sometime this quarter, with silicon on shelves by the summer. Probably. These next generation cards are turning into something akin to Waiting for Godot, but they should see the light of day this year.

More like nerds talkin about sex: “It will be so great!”