More information about double precision in Guide?

e.ping · May 24, 2008, 4:52pm

So far the double precision support has worked as advertised, but the description in the Programming Guide is lacking in some details. Some additions I would like to see:

Quantification of the performance tradeoffs for double vs. single arithmetic. The guide states how many clock cycles single precision operations require, but no guidance is give for double precision. Is it twice as slow? More? Is the fused multiply-add for doubles a single hardware instruction, like for singles?
Is there intermediate truncation in the fused multiply-add operation for doubles like there is for singles? This affects implementing extended precision arithmetic by combining doubles together. (Not an immediate need of mine, but I bumped into this same issue when faking double precision using singles in CUDA.)
Are there no intrinsic double precision functions (equiv to __sinf(x), etc.)? I assume not, since rummaging through the headers shows implementations of cos(x), sin(x), etc that use argument reduction + polynomial approximation over the reduced interval. But, I figured I would check. :)

Also, one more suggestion: I foresee in some of our code having both native double as well as pseudo-double (pair of singles) versions of kernels to support both >= and < sm13 architectures. In these kinds of kernels, I would rather not have automatic double->float demotion for pre-sm13. Instead, it would be better for the compiler to throw an error or warning. Can this be added as a flag to nvcc?

I’m imagining something like -Wno-double-convert which would report any time a double precision variable, constant or function was used when the arch does not support it.

e.ping · May 27, 2008, 11:59pm

I’d like something related - a compiler flag that would automatically demote doubles to floats. Variables and even the math functions are easy to handle; it’s the floating point literals in the code (e.g. 2.50) that cause me problems. My code has so many floating point literals that the compiler with -sm_11 ran out of double precision registers to compile my code. I had to manually change every 2.50 to 2.50f to get it to compile. I’d like a compiler switch to demote floating point literals from double to float without expressing them as doubles internally.

I think there will still be occasions to compile with floats instead of double for performance, even with the GT200.

Mark_Harris · May 30, 2008, 10:10am

The performance of double precision is expected to range between 80 and 100 Gigaflops peak at production clock rates for the GT200 GPU. We will provide more specific information in the near future.

DP FMA is a fused multiply add as specified in IEEE754R (I thought this is pointed out in the Programming Guide?).

The only current DP HW support is for mul, add, fma, and FP<->INT, FP<->FP conversions. For DP sin, cos, etc., use sin(), cos(), etc.

This is not a bad idea. I can file a bug to request this in a future release.

Mark

Mark_Harris · May 30, 2008, 10:13am

Do any CPU compilers do this? I don’t see a big need for this. Just like you should on a CPU, for the GPU you should write float-safe code. Use float literals where you only need float precision, and double literals only where you need double precision.

I don’t see why literals should require registers at all. The compiler puts them in constant memory, not registers. If you have an example where it does not, please file a bug against the compiler using your registered CUDA developer account.

Mark

e.ping · May 30, 2008, 3:09pm

Ah ok, a more careful reading of section A.2 does confirm this. The first sentence is “All compute devices follow the IEEE-754 standard for binary floating-point arithmetic with the following deviations:” and intermediate truncation of FMA is mentioned for single precision, but not double precision. I’m so used to thinking about the single precision case as the only case that I mentally promoted the truncation warning up a level in scope. :)

Although, Sections B.1.1 and B1.1.2 say that both fmaf() and fma() have 0 ulps of error. Does this mean the IEEE-754R standard allows intermediate truncation in a fused multiply-add operation? (I’m not familiar with what the standard requires here.)

Topic		Replies	Views
Double Precision Help... Double precision CUDA Programming and Performance	6	5068	September 1, 2011
Float precision error in matrix multiplication application. CUDA Programming and Performance	14	3577	February 27, 2014
Compile float as 64bit floating point CUDA Programming and Performance	7	1514	September 25, 2016
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	820	August 9, 2023
FMA precision issue CUDA Programming and Performance	9	19358	November 21, 2010
Double double precision arithmetic library now available CUDA Programming and Performance	14	8431	July 2, 2013
Emulated double precision Double single routine header CUDA Programming and Performance	24	49162	October 18, 2010
Problem with double precision matrix float works, double doesn't CUDA Programming and Performance	7	3915	April 22, 2009
Floating Point Accuracy CUDA Programming and Performance	11	30429	April 6, 2013
floating point operations CUDA Programming and Performance	13	6782	May 16, 2010

More information about double precision in Guide?

Related topics