Measurements of different CUDA operator throughputs

I spent Saturday making a quick and dirty tool to (coarsely) measure the throughput of different CUDA operations. We all know that sqrt() is slower than a multiply, but how much slower?

The programming guide does give guidelines to avoid slow operations, like “32 bit int multiples are slower than floats. Use __mul24() if applicable for 4 times more throughput.” and “don’t use integer divide,”

So how to measure the actual speeds? The simple answer is to make a giant unrolled loop and do a million divides or whatever and see how many clock ticks it takes.

That’s the basic test harness I whipped up this afternoon. I’m attaching it here, along with some results. The code is public domain, go ahead and abuse it any way you like.

It’s easy to measure any expression or even function… just add one line to define a integer test like

IDEF(01,  __imul24(a, b));

IDEF(02, myCleverHammingDistanceFunction(a,b));

then in the main() body just call the test with



Just compile with simple

nvcc -arch=sm_13

Take a look at the the results in the attached files.

I am not a microcode, op-level programming expert, and I’d love to have others look at some of the results and perhaps improve on my crude methodology.

(Sylvain, I am especially interested in your input!) So please take my following analysis with some reservations.

Lots of surprises in the results.


    EDIT followup August 23. This is the latest version with latest timings. Double support has been added as well.[attachment=13336:timings.txt]

integer _mul24() addition fusion could be explained by that this is the same good old multiplier as in the fp32 unit, running pretty much the way it was always supposed to do.

The programming guide talks quite a bit about throughput in section
But it’s hard to match some of the concepts to the actual measurements.
For example, the guide says that logf() is the same cost as rsqrt(), but my measurements show them at 30 and 2 clocks, respectively.

The guide also explicitly says the compiler will optimize literal power of two divides like a/2, but that’s also not true by these measurements.

I have also done this sort of benchmark, based on code posted by tmurray to reach the maximum performance of fused-multiply-adds. Unfortunately I cannot post the benchmark right now, but I can post the resutls:

Ah, that’s a great double check!

I also see that the __logf() intrinsic (which I did not test) is much faster than logf().

And you also mention the existence of integer mad24!

Is this from a paper you published? A thesis? I’d love to see more.

Yeah it is from my master’s thesis, but unfortunately I am not allowed to publish it.

The number of cycles after the slash for fmul and mad&mul in succession are the ‘resulting’ number of cycles due to dual issue.

With respect to your measured ‘3’ cycles, I have measured the number of GFLOP/s, by counting flops in the code and measuring the runtime on the host, and have taken 4 cycles for FMAD and calculated the rest based on that.

a/2 is equivalent to a>>1 only if a is always positive or zero. So this optimization should only be applied on unsigned integers. However, since most people use signed integers even if the numbers are never negative, it’s generally good to use a>>1 instead of a/2 if you are sure a is never negative.

Nope. You have unsigned shift (filling with zero) as well as signed shift (filling with the leftmost bit.)

EDIT: Nico proved me wrong, see below

They’re still not fully equivalent though.

(-5)/2 = -2

(-5)>>1 = -3



there is something fishy about that your double divisions always equals a mul, even if the divisor is not a known constant. Me thinks that you are being outsmarted by an idiot savant (the optimizer … :) ), not least because Nvidia marketing is not pushing this astonishing result very loud and clear to the whole world.

I could be wrong, maybe they should?


Great initiative, and nice combined use of templates and string substitutions in macros. :)

I am still analyzing your tests. Here are my comments so far:

Just a naive question: is this related to the fact you run 192 threads (6 warps)? I can’t see where you take the number of threads into account in your calculations.

So the compiler uses the mad24 instruction whenever it can (this is also a PTX instruction).

Be careful. ptxas is very good at demoting 64-bit computations to 32-bit ones…

At the end of your test, you cast the 64-bit result to a 32-bit integer. So the compiler figures out that as long as you are doing only adds, muls and xors, it can just do all the computations on the 32 low-order bits since you will never use the high-order bits… Of course, the optimization fails when you use shifts, so the adds and xors of the ‘empty’ loop also suddenly get turned to 64-bit, which explains why the execution time increases so much.

(This optimization is particularly useful when you’re compiling on a 64-bit platform, so that all address calculations on 64-bit device pointers get demoted to 32 bits.)

The modulus is probably another optimization, I’ll look at it.

This is interesting. Most likely, the hardware overlaps the execution of the RCP and RSQ instructions in the second execution pipe with your xor and adds in the first pipe. It should also affect divides, to a lesser extent…

This would be a very target-specific optimization, and not IEEE-754-compliant (think (int)(14.f / 7.f)).

Although in the case of NVIDIA GPUs in single precision, it would not be worse than the already-non-compliant division by a variable…

Did you try sinf(1.e22f)? :)


Yeah, that’s a renormalization I planned to do, but it’s kind of a “magic multiply” and I was hoping to understand the reason.

Here’s a theory… the clock counter is measuring core clocks and not shader clocks.

Sylvain pointed out that I don’t divide by 192 for the thread count. That’s because I would get crazy clock counts if I did. I know my absolute scale needs to be tweaked, and it’s interesting to see that the tweak isn’t obvious.

All the measurements should be pretty decent to compare to each other relatively though.

mv, you also mention dual issue. That’s seperate from mad fusion, isn’t it? It makes sense.

I did notice a speed boost when using a double precision accumulator in the float kernel (!!) but that could be explained by dual issue between the fp64 core and the 8 fp32 cores.

Does the CUDA hardware sport dual-issue for single SPs? CPUs certainly do with multiple integer pipes for example, and out-of-order and speculative computation, but I thought GPU SPs were deliberately dumb.

Or does a MAD really count as dual-issue? I thought it was a separate opcode.

Nope, it’s shader clocks (unless I have been lied to by everyone).

MAD is a separate opcode and does not count as dual issue. You can dual issue MAD+MUL, for example.

Isn’t it?

For instance, you’re getting 24 cycles/iteration for the addition.

Dividing by the number of warps (6) gives you 4 cycles/warp, just as the (older) manual tells you.

Alternatively, you can divide by the number of threads (192) and multiply with the number of SP (8). Then you get 1 cycle/operation/SP.

(Just to confuse you further, the clock counter runs at the frequency of the register file and instruction decoder (shader/2), but it’s multiplied by two using compiler magic to avoid having to lie to Tim :).)

The “dual issue” capability is a kind of a misnomer (this was originally a feature of the NV40 completely unrelated to the G80…)

It’s just that the SM front-end decodes instructions at a peak rate of 1/cycle, and dispatches them to the execution pipelines.

They are the MAD pipeline with 8 units, the SP/MUL pipeline with 8 multipliers and 2 special function evaluation units, and the double-precision FMA pipeline in the GT200.

Since the front-end is faster than either execution pipeline, it can dispatch instructions from different threads to each execution pipeline in turn, keeping them both (or all three) busy.

So the “dual issue” feature can be considered as a side-effect of having a front-end decoupled from multiple execution units.

Interesting theory… it’d take some careful analysis by the compiler to make sure the entire chain of computes was unaffected by the high bits. But I understand why that would be useful to check for, and maybe even common, especially if someone needlessly used a long long as a loop variable or something.

But to test the theory, all we have to do is save some of the high bits at the end of the compute so the compiler can’t ignore them even in theory. I changed the final accumulator save to use accumulator>>10 and reran the tests.

And there is indeed something to this idea! The results are mysterious… Compilation went from 30 seconds to 25 minutes.

The a>>20 shift case speeds up! It’s now 3 clocks, as is ^ xor. All other ops, including + * / % are a uniform 6 clocks… so even cheaper than before! So this shows that the “long-long can really be int” optimization (if that’s what it is) is actually failing. Or my analysis is wrong, which is more likely. Still, Sylvain, your instincts were correct, there’s definitely an interesting compiler mode switch happening there when only the low order bits are accessed.

And now we have the big question, if these measurements are correct, why doesn’t the compiler cast all integer divide and modulus values to 64 bits, do the op, then cast back? Seems like it should be a speed win. More tests are needed…

Even with IEEE conformance, is it explicitly mentioned that a/b != a*(1.0/b) in general?

You’re right though that even if it is true, the compiler could still optimize it for CUDA since it knows that divides aren’t guaranteed to 0ULP anyway.

I just tried with a/4.0f and a/4 which are both explicitly optimizable with 0 error. The results… those are indeed fast, 3 clocks.

So the compiler is optimizing only those exact cases.

Sylvian, again you have great explanations for the odd behavior. Thanks!

Yeah, the sinf() calls are full range, since the arguments get incremented to wild numbers of all magnitudes. And that’s still the (fast!) speed.

It would be interesting to make a test where the arguments are designed to always be < pi/4 or something to see if there’s nice special cases for reasonable arguments.

Excellent point! And in fact testing shows that ((unsigned int)a)/2 is indeed much faster than a/2. Another mystery solved.

This feedback is great, it’s exactly the way to learn all those little gritty details!

Got it. I had that initially, but I think I discarded it when I was getting crazy results… but those ended up being due to the mad24 0 cost computes. My first two tests were __mul24() versus a*b… an atypical test case, it turns out.

I’ll post updated source and timing lists.

Unfortunately compiles are now super-painful (40 minutes) with the new “record a>>10, not a” trick for 64 bit longs. I’ll try to make a standalone test case… either for people to figure out what’s going on, or for a bug report if it looks OK.

Isn’t it because your device may be of low compute capability?

OK, did a specific test.

This is using new clock scaling, where a+b is one clock. These numbers all all directly comparable.

For 32 bit a and b,

38.0 a/b
13.3 a/1837371
19.9 a%1837371
38.0 13234567/b
7.0 (int)(((long long)a)/((long long)b ))
13.3 (int)((long long a)/1834761)
6.6 (int)(13234567/((long long)b ))

From this, it appears casting to a 64 bit int for divides and modulus ops is a huge win.

And even the (int)((long long a)/1834761) case may be incorrectly too fast… here it may be that the compiler is trying to optimize by switching back to ints (which is why the speed matches int speed), even though long long division would likely be 7 clocks, not 13.3.

Nope, both my 280GTX and quadro FX 4800 are 1.3 devices. And yes, I’m compiling with -arch=sm_13 .

I haven’t dived back to figure it out more yet.

An NVIDIA employee once told me that there is no such thing as dual issue between double & single precision, as the double precision core uses some circuits of the single precision cores, so there should be another explanation for that.