I recently switched from a Quadro FX 4600 to a Quadro 4000. My application is an iterative tomographic reconstruction algorithm that makes heavy use of texture lookups. After the switch, I get slightly different results when using the new GPU. This is not just on the LSB, but up to 1.5% of the final value. I would have expected small differences but not such large ones. My computations are all in single precision float arithmetic.

I’m using CUDA v3.1. I have tried some compiler switches (different -arch and -code options, the " -ftz=true -prec-div=false -prec-sqrt=false" options, different optimization settings). Most leave the results on the Quadro 4000 unchanged or change, at most, the LSB of the result (after saving it as 16-bit unsigned integer).

Therefore my question: Does anyone know anything about changes in the way texture lookups are implemented in the two cards - any changes in coordinate computation, round-off, clamping, …? Any changes to math functions? Any other known changes that could cause the differences in results? Any compiler switches or code changes I could try to make results between the two cards more similar? Unfortunately, it would be very difficult to provide a compilable example to reproduce the problem.

Apologies for the vague problem description, and thanks for any help.
ehan6

Does the application perform computations other than through texture lookups? One change from the Tesla architecture to the Fermi architecture that may come into play here was the replacement of the FMAD instruction with the FFMA instruction.

The FMAD instruction multiplies two single-precision numbers, truncates the result to single-precision, then adds a third single-precision number and rounds the final result to single precision. By contrast, the FFMA instruction implements an IEEE-754 (2008) compliant fused-multiply-add, in which two single-precision numbers are multiplied, and all bits of the result (without truncation or rounding) feed into the addition with the third single-precision number, and the final result is then rounded to single-precision.

The CUDA compiler automatically tries to map computations of the form ab+c into FMAD/FFMA for best performance. The FFMA has many desirable properties compared to the FMAD. Result will generally be slightly different from FMAD and usually slightly more accurate since the entire operation is subject to just a single rounding. Cases in which more than LSB changes to results are observed are typically those where ab and c are of roughly equal magnitude but of different sign, leading to a reduced precision result with FMAD due to subtractive cancellation but a fully accurate result with FFMA.

Since the CUDA math library makes heavy use of FMAD/FFMA slight differences in most non-IEEE-rounded functions occur due to the differences between the two instructions, however the accuracy bounds dcoumented in the Programming Guide are maintained in either case.

Is there any way for me to test whether this change is actually what causes the difference in my results? Can I force use of FMAD (or a suitable emulation) on the Fermi card?

Such cases may occur in my computations, but I assume that it is very difficult to measure or estimate the impact of the FMAD/FFMA change. Any ideas on how to test this?

I do not see an easy way to verify/falsify the hypothesis that the FMAD/FFMA differences are the root cause for the differences you are seeing. This would require precise control as to where FMAD/FFMA should be used and where not, which requires a lot of prep work since both the compiler frontend and backend can apply the contraction FMUL + FADD → FMAD/FFMA. The CUDA math library does this by using explicit intrinsics for FMA, FMUL, and FADD extensively, but user code is not typically written his way. With tight controls in place one could replace FFMAs with some FMAD emulation.

Note that simply turning off FMAD/FFMA merging altogether would not help since that results in a third behavior (rounded multiply followed by rounded add) which matches neither the FMAD nor the FFMA behavior.

If you have a higher precision reference implementation available, I would suggest comparing both Tesla-arch and Fermi-arch versions against that to establish that they are within acceptable error bounds. I use that approach a lot in my own work.

Now I at least isolated the code that makes the difference (or so I believe). For an image filtering operation I compute a truncated sinc kernel sinc(x)sinc(x/2), with sinc(f) = sin(Pif)/(Pi*f). Some (pseudo-)code:

// ...
weight = 1.0f;
x = filterPos - kernelPos * scaleFactor + offset;
x *= PI;
if (f != 0.0f)
{
weight *= __sinf(x)/x;
x /= 2.0f;
weight *= __sinf(x)/x;
}
// ...

This is obviously not the numerically most stable code. I see two issues:

If x slightly changes between compilations / cards, the test (x != 0.0f) may or may not evaluate to true. The next block therefore may or may not be executed and may or may not introduce its round-off errors.

If x is non-zero but very small, small changes in how __sinf is evaluated on different platforms may lead to sizable differences in the value of __sinf(x)/x.

These issues together seem to result in my observed differences in computation results, due to changes in the filter weights. Of course, I will fix them going forward.

It would still be helpful if I could force the Quadro 4000 card to produce similar results to the FX 4600 card with my current code. I tried adding some explicit __fmul_rn and __fadd_rn commands to the computation of x, to fix the order of computation and prohibit fused multiply-add. This, however, already leads to visible differences in results between the original code and the modified code, on the same FX 4600 card. Also, if the real issue is in the internals of __sinf, there’s probably not much I can do.

A related question: without explicit __fmul_rn and __fadd_rn I get mul.f32 and add.f32 commands in the ptx. With the explicit commands, I get mul.rn.f32 and add.rn.f32. Are those executed identically? The programming guide says that round-to-nearest is the default.

The add and multiply operators map to add.32 and mul.f32. No rounding mode is specified, so they use the default rounding mode, which is round-to-nearest-or-even. More importantly for the issue at hand, operations with default rounding mode are eligible for FMAD/FFMA contraction. The device intrinsics __fadd_rn() and __fmul_rn() on the other hand map to add.f32.rn and mul.f32.rn. Their rounding mode is fixed as round-to-nearest-or-even. Operations with explicit rounding mode are not eligible for FMAD/FFMA contraction. This is covered in section C.2.1 of the Programming Guide.

The fact that disabling FMAD/FFMA merging locally by use of __fadd_rn() / __fmul_rn() changes the results noticebly is a pretty clear indication that the code is sensitive to one ulp changes (as noted previous using separately rounded multiplies and adds will generally deliver result that differ from both FMAD and FFMA).

Note that __sinf() maps to a hardware instruction that is designed for absolute, not relative error (see table C.4 in the Programming Guide). This means that for arguments that are small in magnitude the granularity of the results is fairly coarse, and tiny changes in the input could cause fairly large relative differences in the output. sinf() on the other hand is designed for small relative error and would presumably make this computation behave more robustly. Of course its results are also usually different from __sinf() and the function is significantly slower than __sinf().

In general FFMA allows for more accurate computation than FMAD, so if you were to compare the results from both Quadros to a high-precision reference you would likely find that the FFMA-based computation on the Quadro 4000 delivers the more accurate results. As I stated previously there is no straightforward way to achieve bit-for-bit matching between these two hardware platforms.

I will note that due to changes in compiler optimizations and libraries there should be no expectation of bit-for-bit matching of floating-point results between different versions of CUDA (the same caveat applies of course to any other software platform).

[Later:]

I didn’t see your post from 12:48 AM before I posted the above. Your experiment confirms my hypothesis that use of sinf() instead of __sinf() would reduce the differences between the two Quadro platforms.