CUDA Double Precision Performance 933 GFlops vs 78GFlops

AMD’s latest Firestream 9270 claims 240GFlops of double precision performance.
[url=“http://ati.amd.com/technology/streamcomputing/product_firestream_9270.html”]http://ati.amd.com/technology/streamcomput...tream_9270.html[/url]

Even the CELL processor that offers 460 GFlops of single precision performance offers
close to 200 GFlops of double precision (almost 1/2)

TESLA C1060 claims 78GFlops of double precision performance and 933GFlops of single precision performance.

Why such an abysmal value for double precision performance??

Is this all just marketing gimmick? Does any1 have any numbers on single and double precision performance of TESLA for the same algorithm???

Thank you
Best Regards,
Sarnath

Ahh… I got the answer from an earlier thread:
[url=“http://forums.nvidia.com/lofiversion/index.php?t75452.html”]http://forums.nvidia.com/lofiversion/index.php?t75452.html[/url]

It is 8 SP units per MP : 1 DP unit per MP. That makes the difference.

Well, I see I have participated in that thread actively…

Certain things never get registered in my brain… Sigh…

Anyway,
How do I hide this latency? Say, a WARP is executing double precision multiply… now all 32 threads issue this instruction and there is just 1 DP unit to do the job… I presume they happen one by one and would stall the instruction…

Any idea how to intelligently hide this latency?

Are there any plans from NVIDIA to atleast provide 3 DP units in future hardware? (Nonsense question??? i am no hardvare guy)

Thanks

Well, you have the same latency hiding like with single precision. Only you can now load much more data before you get calculation-bound ;)
ATI is not doing a real DP as far as I remember, they do 4 SP instructions to get a DP result. GT200 has a real IEEE compliant DP unit (including full speed denormal handling).

Also, I’d think that in many (if not nearly all) algorithms, you could make a first run in SP, then use your ‘best’ result from that as a starting point for the DP code, which would only be needed to “refine” the solution to a few more decimal places of accuracy.

From where do get that information? Afaik: the “normal” first generation CELL with 1 PPE and 8 SPEs at 3.2GHz deliver around 230GFlops in single-precision and drops seriously in DP to around 14GFlops due to pipeline stalls.

The improved version called Cell2+/PowerXCell 8i (which I don’t know much about) is “only” quoted to have around 100GFlops in DP.

Edit: Added “afaik” :)

so it is in the wikipedia page, but that page points to a Jack Dongarra and Alfredo Buttari worknote on Iterative Refinement method on mathematical examples, to obtain “double precision” using physical “mixed precision” (i.e. using single precision to compute the big work… and so using double precision to refine it, as profguail said…). the peak of DP is 14 Gflops too…

Do you think such approach would be possible, when I need to compute a complex formula with a bunch of multiplications, additions, divisions and some trigonometric functions?

We have a CELL blade with us that has 2 PPE and 16 SPEs … so that is = 460GFlops in single precision… and 200 in Double precision (The wikipedia shows 100GFlops only for 8 SPEs. QS22 has 16 SPEs and 2 PPEs)

Anyway, Check this IBM PDF out: ftp://ftp.software.ibm.com/common/ssi/pm/…LD03019USEN.PDF

This is got from the QS22 home page: (click on Data sheets link to load the same PDF)

http://www-03.ibm.com/systems/bladecenter/…qs22/index.html

EDIT: In the PDF, just search for GFLOPS

Probably, but it depends on your algorithm. If the algorithm you’re using is numerically stable (i.e. small perturbations in x produce small perturbations in f(x)), then yes, it should work out for you. A counter-example is the Mandlebrot set, in which case this approach doesn’t work because the values of the output can jump dramatically even for the smallest change possible (machine epsilon). The only other thing I can think of is that perhaps some Monte Carlo-type methods may not work if you’re trying to be ultra-precise either (since there’s no way to use single precision results to seed double precision computations).

For most other things though, I think it will work; if your code isn’t really, really complex, it probably wouldn’t take you too long to try it out and see what the results are like (and how much, if any, speedup you got from it).

We have implemented our own highly tweaked version of n body for both double and single precision on tesla c1060.

Double g flops = ~54 out of max of 78…

Single = ~500 out of mas of 933

The paper we have worked on also compares other multi core platforms one of them being IBM qs22 blade ( 220 gflops double peak !! ) but its very hard to achieve that and it may not be even possible to achieve even 1/10th of that peak. This is because CELL lacks hardware implementation of inv sqrts and divides and the Nbody kernel has both of them so it killed the DP performance for the IBM cell…

BUT Nvidia should try more seriously to improve DP performance… SO THAT ITS NOT 8-10 TIMES SLOWER !!

Thanks, hope this helps…

Having two CELLs in your Blade certainly helps me put the figures together :) The way you wrote it just made me think that a single CELL should have that kind of performance.

It’s almost trivial for them to boost DP performance… they just need to stick another DP ALU in each SM. Or 3 more. Or even change all 8 SP ALUs into DP.

But those all cost transistors. The question is where the best TRADEOFF is… what balance of DP and SP is most efficient over the whole range of GPU applications. And balance register and shared memory size in there as well.

I’ve found that the current balance is pretty nice… I would be upset if another DP were added at the expense of losing some SMs for example.

I use DP in raytracing, but I found the lower throughput DP is fine for doing the high precision “setup” computes and then the SP ALUs cook from there.

Considering that ATI doesn’t have native DP at all, you can’t complain much about NVidia’s full implementation.

Thanks for your answer. Could you please give me a simple example? Say, I want to compute atan((xy+xz)/(x+z)), using single precision estimate + double precision correction. How should I tackle this?

I am already familiar with this approach used for solving of linear systems (http://www.netlib.org/lapack/lawnspdf/lawn175.pdf), but I do not see how I would use it to compute the formula above.

Sorry, I guess I was a bit unclear about what works and what doesn’t. For something like a trig function, there’s really no reason to compute in SP then correct to DP; however, if you were (using your example) using x, y, and z as parameters to a larger kernel that attempted to minimize some nonlinear system of equations, and part of the output depended on the result of the atan() function, you would run the entire kernel using single precision in an attempt to find good ‘guesses’ for the minimum points then use those ‘guesses’ as starting points for a double precision kernel to get further accuracy. In this sort of general case, 99% of the work could be done in single precision, and once a good “guess” is found, a DP kernel would only be needed to get a few more decimal places of accuracy.

You have the right idea with the linear systems approach; that is what I do a lot of, and so most of the code I’ve seen with “mixed precision” has to do with them as well (though, to repeat myself, it should be adaptable to many other problems).

Ok Sarnath,
www-03.ibm.com/technology/resources/technology_cell_pdf_PowerXCell_ PB_7May2008_pub.pdf,
here’s the real specifications for a single IBM PowerXCell 8i.
100 Gflops in DP, 200 Gflops in SP.
Nvidia’s 78 Gflops is less but not too far from 100 Gflops,
but using mixed precision methods I can gain much more performance (proportionally) using Geforce rather than Cell.
With PowerXCell you can reach 100 Gflops using pure DP, but less than 200 Gflops using mixed precision,
with G200 you can reach 78 Gflops using pure DP, but you’ve up to 930 Gflops of upper bound for mixed precision
(also if in practice the SP upper bound would be around 400 Gflops, it’s more than 200)

Sarnath, if you’ve got a CELL blade, you may want to search the forums…there was a student (from Georgia Tech, I believe) that wrote some kind of PTX driver/runtime for the CELL (so that you could write a CUDA program, keep the PTX output of the compiler, and then have a program that would run on both a CUDA card or a CELL processor).

@Fuql,
Sure. Thanks. We are on same page now.

@Nitin.Life,
Thanks for publishing some results. I have the same concern as you. Running @ 1/10th the speed of SP is big turn-off.

@SPWorley,
People often keep talking about the nanometer thing in chip maufacturing… Lesser the size - more transistors… I think NVIDIA still uses a bigger nanometer size… If they can come down to 45 nm or still lesser (intel is looking @ 10 nm - as per Pat’s interview to Forbes), they can stuff this all in… I am a complete noob to hardware… possible that what i am writing is non-sense…

@Samuel, @profquail,
Yeah mixing precision is definitely a good idea to boost performance. Thanks! I will try to keep this in mind when I move on to DP.

@profquail,
Good to know about the PTX run-time for CELL… Just amazing – the kindaa work people do. Thanks!

Nope, you’re completely correct, they will likely have more transistors to “spend” in future devices.

You’re right that those new transistors can be used for DP.

But my point was that it’s a tradeoff… if you have more transistors, do you spend them entirely for DP? Or a mixture of SP and DP, and if so, in what ratio? I think the current 8:1 ratio is a good sweet spot for most computations. 16:1 might be even better for mainstream graphics though, and 0:1 might be nice for some supercomputer applications.

For my main raytracing algorithms, the current register/shared/SP/DP ratio is surprisingly nearly perfect.