Measurements of different CUDA operator throughputs

Did you also initialize the high-order bits of arg1 and arg2 before the loop starts?

Here is what I get (normalized by add=1 clock, CUDA 2.3, 9800GX2):

L01 1.0 a^12345

L02 1.0 a>>20

L03 2.0 a+1

L07 2.0 a*0x12345678753121LL

L15 2.0 a^0x123456712345671LL

L16 19.9 a*b

L17 19.9 a>>(int)b

The first xor just operate on the lower bits, so 1 clock.

Additions need an add instruction on the lower bits, then an add-with-carry on the higher bits, so 2 clocks.

64-bit xors can be split in 2 32-bit independant xors.

The right shift by a constant is more complex, but can be explained: you xor together the results of all shifts. Simple boolean algebra shows that you can just xor the unshifted values together and perform the shift only at the end of the loop.

So it’s actually surprising the result is not 0.

Then it seems multiplications by constants can be really efficient. Looking at the code, it seems to use the mad-with-carry extensively.

Didn’t try with other numbers, such as numbers specially crafted to trigger a carry-propagation chain.

Yes. The standard defines exactly what a / b is: basically it’s the floating-point value closest to the exact result of the division.

If you multiply by the reciprocal, you will suffer from an extra intermediate rounding.

An easy example is 14.0/7.0 The answer is trivially 2.0, exactly representable in FP and what IEEE-754 requires the division to return.

But 1.0/7.0 is not representable in FP, so it’s rounded to a slightly smaller value.

Then when you multiply back by 14.0, you get an answer slightly lower than 2.0. Oops…

Good to know, thanks.

Are you sure it doesn’t get too wild, namely, +infinity?

Roughly speaking, each iteration doubles the values of arg1 and arg2 by adding them together. So after just around 110 iterations you are probably overflowing your float variables.

So I strongly suspect that what you are testing is the code:

if( isinf(x) ) return NANF;

at the beginning of the sinf function… ;)

Can’t reproduce it here. I get instead:

I52 33.5 a/b

I53 75.1 (int)(((long long)a)/((long long)b ))

(same setup as above)

I’m not sure who told you this, but it’s not true, our double precision units are separate. ATI does implement their double precision using the SP units, apparently, maybe they were confused!

Both of you may be right…

According to David Kanter’s article, the double-precision unit shares its instruction dispatch port with the single-precision unit:
[url=“NVIDIA's GT200: Inside a Parallel Processor”]Real World Tech

This is plausible and sounds like a reasonable architectural tradeoff.

Could make use of Steve’s test harness to check this hypothesis…

Hmm, do the 64 bit integer ops as measured depend on compute capability 1.3?

What happens if I try 64 bit integer arithmetic on a compute 1.1 device? Will the compiler produce less efficient 32 bit code?

Good question.

I compared the code generated for various compute capabilities. There is indeed a difference in register scheduling and mov operations, but it’s only between CC 1.1 and CC 1.2, so it’s not related to the double-precision unit.

So assuming there is support for 32-bit multiplies and 64-bit moves in the double-precision unit, it is not currently used by the compiler. (Actually it would be slower than emulating these operations in the single-precision units, given that muls are 4 ops and moves are 2 ops.)

You can’t use the DP unit and SP units at the same time.

I was not claiming that it wasn’t a seperate unit, just that some circuitry was shared, preventing double precision operations to be overlapped with single-precision operations (which I was hoping would ‘hide the double precision latency’). It was mfatica who told me (just saw his confirmation above)

I got a question regarding the division of floats.

As discussed already there is this at first sight weird behavior that a simple reciprocal is way faster than a real divide:
F03 3.2 a*b
F07 2.2 1.0f/a
F08 18.9 a/b

Now you obviously could do something like a*(1.0f/b) instead a/b but as explained earlier in this thread this is not standard conform since you have an additional rounding operation in between (as an example 14/7 was given which would be equal 2 if done by a/b and slightly smaller than 2 if one would use a*(1/b)). But is there a possibility to set a compiler switch or something to make this speed optimization anyway for kernels where I don’t care if the result is exact? For example if I do Molecular Dynamics or Monte Carlo I anyway have a chaotic system, so it does not matter for me if I have some more “random” influence.

So there are actually 3 versions of the single-precision division in CUDA, not 2.

    [] __fdividef is just a(1/b), not compliant but fast. Used by default when the -use_fast_math flag is set.

    [*] __fdiv_rn is the IEEE-754-compliant division. Implemented in software since CUDA 2.2 with a Newton-Raphson iteration using fixed-point arithmetic and a lot of control logic. Very slow. (Need to add it to Steve’s bench to know how much “very” is.)

    [*] a / b when -use_fast_math is not set is both non-compliant and slower than __fdividef. It is intended as a workaround for the lack of denormals in single-precision. It is implemented as something like:

    if(exponent(b) > 126) {
    
      // Taking the reciprocal of b would return a denormal
    
      // which then would be flushed to zero, causing the division to return infinity.
    
      a *= .5f;  // So shift everything toward zero
    
      b *= .5f;
    
    }
    
    return a * (1.0f / b);
    

By default the third version is used. If you specify -use_fast_math, the first one will be used.

Unfortunately, there is currently no -use_slow_math nor [i]-i_really_insist_that_i_want_my_arithmetic_to_be_correct_not_

fast[/i] flags, so users who want correct rounding need to call __fdiv_rn (and __fadd_rn and __fmul_rn) directly.

I finally went back to this project to get it cleaned up, with proper clock calibration and double operator measurements.
The updated code (and a list of several hundred operator throughputs!) is attached to the first post in this thread.

Some of the bizarre timings I had found and worried about previously were simply explained by clock() wraparound! It can hold only 2^32 ticks, and at fast shader rates,
that’s only about 3 seconds before it wraps. So a very slow operation might wrap and suddenly alias to a much faster speed than the real timing. This was definitely the case with my puzzlement over 64 bit divide speeds. (I was excited they were so fast. Umm… no, unfortunately they were so slow they wrapped.)

And even now there’s some negative timings for some double operations… those likely are wrapping now too. (I didn’t catch that before I uploaded)

Some new puzzles appeared with the full set of timings. In particular, 64 bit constant multiplies seem broken… it takes 8 times as long to multiply a variable by a constant as it takes to multiply two variables!

L05 141.6 a12345
L06 192.5 a
0x12345678753121LL
L07 23.9 a*b

Perhaps related, there’s a compiler issue with those constant multiply cases. The timing program itself takes about 30 seconds to compile when about 100 tests are active.
But if you include the single a*12345 test, compilation time jumps to 9 minutes. No other operation for any type affected compile time.

OK, uploaded a new timings.txt that doesn’t have rollover for the doubles, so those measurements are now OK as well.

I’m still bothered by the 64 bit integer multiply speeds. I’ll look into it some more.

I isolated a good test case for the 64 bit multiply issue and reported it to NV as a bug. Compilation with a kernel using ab is about 2 seconds. Compilation with 12345a is about 40 minutes.

It may be a compiler that’s trying to be too clever. The runtime performance of ab is also about 6 times faster than the runtime performance of 12345a, so that may be related.

Thanks Mr. Worley. This thread may be one of the most useful CUDA threads ever.

Christian