In typical CUDA programs, the truncation in the intermediate step in FMAD doesn’t significantly effect the accuracy of the output. I have recently run into a couple of cases where it does. I’m posting this in part to help make people aware of this and in part to ask for help from the floating point gurus - as I can’t exactly explain why FMAD biases the results, and cannot come up with a reasonable set of rules to identify when this might occur before actually observing the problem.
Thankfully, Fermi’s FMA instruction operates at full precision and does not exhibit this problem - but established apps still need to support older hardware for at least a few years yet.
Here is a simple case that illustrates the problem: It is a highly simplified form of a molecular dynamics simulation a colleague was setting up. We have N particles moving under newton’s laws. A random and drag force are applied at each step as a thermostat and model of a solvent (brownian dynamics) - and for this test case the particles do not interact. The twist is that these particles are constrained on the surface of a sphere. This is implemented (essentially) by allowing the particle to temporarily move off the sphere. The closest point on the sphere is computed and the particle then forced to that point. It is in this closest point computation where FMAD produces a bias. I’ve attached 2 snapshots showing the results of a simulation performed, one using fmad and one using add + mul. You can see that the mad simulations bias the particles toward two poles, while the add+mul one correctly has a uniform distribution.
Here is the code:
Scalar3 is float3 (maps to double when build for double precision on the CPU)
P is the center of the sphere
U is the point off of the sphere that the particle has moved to
C is the computed closest point to U on the sphere
DEVICE Scalar3 evalClosest(const Scalar3& U)
{
// compute the vector pointing from P to V
Scalar3 V;
V.x = U.x - P.x;
V.y = U.y - P.y;
V.z = U.z - P.z;
// compute 1/magnitude of V
Scalar magVinv = rsqrt(V.x*V.x + V.y*V.y + V.z*V.z);
// compute Vhat, the unit vector pointing in the direction of V
Scalar3 Vhat;
Vhat.x = magVinv * V.x;
Vhat.y = magVinv * V.y;
Vhat.z = magVinv * V.z;
// compute resulting constrained point
Scalar3 C;
#if(USE_FADD_FMUL)
C.x = P.x + __fmul_rn(Vhat.x,r);
C.y = P.y + __fmul_rn(Vhat.y,r);
C.z = P.z + __fmul_rn(Vhat.z,r);
#else
C.x = P.x + Vhat.x * r;
C.y = P.y + Vhat.y * r;
C.z = P.z + Vhat.z * r;
#endif
return C;
}
The problem occurs when P.x and P.y are “large” (45 in this case) and the particles go to the z poles of the sphere. Can anyone think of why fmad might cause this behavior? Better yet, how to identify what cases will have biases introduced by fmad before noticing it in simulations.
So each of Vhat.x * r, Vhat.x * r, Vhat.x * r gets systematic error toward 0. And P.z is the least affected component, because C.xyz error is proportional to the C.xyz (ulp).
Mr. Anderson, I bet you could reduce this report down to an example of three floats A B C where A+_fmul_rn(B, C) shows significant deviation from A+B*C.
Just pick one of the particles and print the values being fed in and the output result in the two cases.
Seeing the actual values for the ABC and the result may give some intuition to what’s happening.
Though a follow up question… these examples are likely not the divergence after running the function once… it’s likely a cumulative effect after many iterations? How many iterations are needed until the bias is visible?
Probably system wants to reach energy minimum, that way cause of (V.xV.x + V.yV.y + V.z*V.z) it is better to increase x,y cause of 100+100+0.00001<0.00001+100+100.
So I suggest you run your computation on a good old NV40: its multiplier has a terrible accuracy, but a much lower bias than G80’s. External Image
About that particular function: is r a constant? Can it be made a power of 2? Or better yet, run all calculations on a sphere of radius 1, and only “inflate” it at the end?
Also, it seems you could factor the multiplication by r, by computing r * magVinv first (may not help about truncation, but saves 2 muls).
Hah! I had heard of that one, but it was always summarized down into a clock overflow problem - not a roundoff error one.
r is a constant, but everything has to fit into the coordinate system of a much larger system of particles - and they set the scale. This is why there is a sphere at 45,45 - there are also lots of of other spheres at other coordinates.
This particular bit of code is only one tiny part of the whole and yet its biasing caused some head scratching when all simulation results showed z-polar orientations defying principles of statistical mechanics.
:"> it does indeed. That would have prevented the compiler from issuing some of the FMADs as well… Honestly, I wasn’t thinking much about saving muls with this code as it is entirely bandwidth bound so FLOPs are free and it only takes up <1% of the total runtime anyways.