So far the double precision support has worked as advertised, but the description in the Programming Guide is lacking in some details. Some additions I would like to see:
Quantification of the performance tradeoffs for double vs. single arithmetic. The guide states how many clock cycles single precision operations require, but no guidance is give for double precision. Is it twice as slow? More? Is the fused multiply-add for doubles a single hardware instruction, like for singles?
Is there intermediate truncation in the fused multiply-add operation for doubles like there is for singles? This affects implementing extended precision arithmetic by combining doubles together. (Not an immediate need of mine, but I bumped into this same issue when faking double precision using singles in CUDA.)
Are there no intrinsic double precision functions (equiv to __sinf(x), etc.)? I assume not, since rummaging through the headers shows implementations of cos(x), sin(x), etc that use argument reduction + polynomial approximation over the reduced interval. But, I figured I would check. :)
Also, one more suggestion: I foresee in some of our code having both native double as well as pseudo-double (pair of singles) versions of kernels to support both >= and < sm13 architectures. In these kinds of kernels, I would rather not have automatic double->float demotion for pre-sm13. Instead, it would be better for the compiler to throw an error or warning. Can this be added as a flag to nvcc?
I’m imagining something like -Wno-double-convert which would report any time a double precision variable, constant or function was used when the arch does not support it.
I’d like something related - a compiler flag that would automatically demote doubles to floats. Variables and even the math functions are easy to handle; it’s the floating point literals in the code (e.g. 2.50) that cause me problems. My code has so many floating point literals that the compiler with -sm_11 ran out of double precision registers to compile my code. I had to manually change every 2.50 to 2.50f to get it to compile. I’d like a compiler switch to demote floating point literals from double to float without expressing them as doubles internally.
I think there will still be occasions to compile with floats instead of double for performance, even with the GT200.
The performance of double precision is expected to range between 80 and 100 Gigaflops peak at production clock rates for the GT200 GPU. We will provide more specific information in the near future.
DP FMA is a fused multiply add as specified in IEEE754R (I thought this is pointed out in the Programming Guide?).
The only current DP HW support is for mul, add, fma, and FP<->INT, FP<->FP conversions. For DP sin, cos, etc., use sin(), cos(), etc.
This is not a bad idea. I can file a bug to request this in a future release.
Do any CPU compilers do this? I don’t see a big need for this. Just like you should on a CPU, for the GPU you should write float-safe code. Use float literals where you only need float precision, and double literals only where you need double precision.
I don’t see why literals should require registers at all. The compiler puts them in constant memory, not registers. If you have an example where it does not, please file a bug against the compiler using your registered CUDA developer account.
Ah ok, a more careful reading of section A.2 does confirm this. The first sentence is “All compute devices follow the IEEE-754 standard for binary floating-point arithmetic with the following deviations:” and intermediate truncation of FMA is mentioned for single precision, but not double precision. I’m so used to thinking about the single precision case as the only case that I mentally promoted the truncation warning up a level in scope. :)
Although, Sections B.1.1 and B1.1.2 say that both fmaf() and fma() have 0 ulps of error. Does this mean the IEEE-754R standard allows intermediate truncation in a fused multiply-add operation? (I’m not familiar with what the standard requires here.)