Originally published at: https://developer.nvidia.com/blog/lerp-faster-cuda/
Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units). Linear Interpolation (from Wikipedia) You can enable linear interpolation (also…
Good tip, but this is not something that users should have to do.
Nice post. What makes the FMA implementation higher precision? Is it for values of t close to 1?
I assume by *this* you mean the optimization, not leaping in general.
Users don't *have* to optimize their lerping. They can lerp the traditional way with one extra instruction. It's not usually a huge deal. This tip is for those who really want to squeeze out performance. As I wrote, the compiler won't do the transform automatically because if it did this in all cases, there would be situations where numerical equality wouldn't be preserved. The compiler needs to be conservative.
The optimized function could be provided by a library -- and library authors are one target for this post.
I should also note that on hardware that doesn't support FMA, the approach is likely to be slower. On NVIDIA GPUs it is faster.
Hey Nico! The FMA computes the multiply and add at full precision before rounding, while a separate multiply and add would incur two rounding operations, reducing accuracy. http://en.wikipedia.org/wik...
(Note that I don't think the lerp computation suffers from the unfortunate situation attributed to Kahan mentioned in that wikipedia article.)
Norbert Juffa (who suggested the tip) told me that in tests, the maximum error in the lerp variant with two FMA operations was <= 1ulp, while for the traditional method the error was <= 2 ulp. I didn't write that in the pro tip though because testing wasn't extensive. Establishing error bounds can be tricky. However, we can say that the FMA-based variant is more accurate on average.
Hi Mark! Right! I forgot that the FMA unit works doesn't throw away any bits until after addition. I guess that helps when the arguments of the inner addition are very close to each other (catastrophic cancellation: http://en.wikipedia.org/wik..., in other words when t is close to one, right?
I think C.C. only happens with subtraction, right? That would require either v0 or v1 to be negative, since t usually isn't. In which case sure, that can happen. :) But I guess the FMA approach could be more accurate even ignoring catastrophic cancellation cases.
Mark,
I talked about this exact thing at GTC 2015. You can see it in my slides from the presentation (page 33);
http://on-demand.gputechcon...
And yes Norbert was the one who suggested this to me as well.
What I am saying is that this is an optimization that a compiler should be smart enough to do, especially since a lot of compilers allow some option (e.g., fast-math in gcc https://gcc.gnu.org/wiki/Fl... ) that enables such unsafe optimizations.
Yes, compilers cannot do all the optimizations possible, especially when there are layers and layers of code, but it looks that this is something that can (and should) be supported.
Well, in fma(-t, v0, v0) v0 and v0 are the same sign, and -t is typically negative, so C.C. will certainly happen when t ~= 1, i.e. -t*v0 is *almost* as large as v0, but opposite sign.
Good point! But I thought you were comparing the accuracy of the FMA version to the non-FMA version, which is (1-t)*v0 + t*v1. In that version, the compiler will generate an FMUL, FADD, and FMA. And I don't think C.C will happen in that instruction sequence unless one of the three parameters is negative.
Cool! Norbert knows his stuff.
nvcc has fast math: --use_fast_math. But this optimization still doesn't happen with that enabled, I just checked. Have you checked that this does happen with `gcc -funsafe-math-optimizations`? I wouldn't be surprised if it does not.
Fair enough :) Thanks for clarifying!
'such unsafe optimizations" ? Really? Unsafe in what regard?
The best demonstration of the performance improvement of using the 'unsafe' fast math flag in CUDA is to compile the following with and without that flag:
https://github.com/OlegKoni...
After much testing I found very little difference between values generated using that 'unsafe' flag when compared to equivalent 64 bit MATLAB computation.
Also the accuracy of those CUDA math functions used by that flag is detailed here:
I usually use the form v0+t*(v1-v0) which can be more accurate for small differences - you said add followed by multiply, but fma seems in the reverse order?
Can this form be as fast?
Stuart, the form you suggest is precisely the form that Wikipedia points out is an "Imprecise method which does not guarantee v = v1 when t = 1, due to floating-point arithmetic error." (https://en.wikipedia.org/?t... I would also think that it would be prone to catastrophic cancellation when v1 is orders of magnitude larger than v0.
I checked with Norbert Juffa, who says the accuracy of this alternate form is superior only when v0 and v1 are within a factor of two of each other. That is due to the Sterbenz lemma which states that the subtraction of two positive floating-point numbers a and b, that is, a-b, is exact for 0.5*a <= b <= 2*a. Therefore the maximum error in the final result is simply the error of the FMA, thus 0.5 ulps. However, the moment you allow v0,v1 to be apart by more than a factor of 2, the accuracy of v0+t*(v1-v0) suffers: if they can be within a factor of four of each other, the maximum error increases to 1.5 ulps. If they are within a factor of eight, 2.5 ulps. Etc.
The form with two FMAs from the post has error <= 1 ulp across a very wide range of arguments, so for a general purpose lerp code, we think this is still the most robust form. But clearly there is a trade-off, and if it is known in advance that the interpolants are always within a factor of two of each other (e.g. for interpolating in a known table, rather than sensor data), the proposed alternative form would be the way to go.
Regarding performance: this form would generate an FADD and FMA, so it should be as fast as the 2-FMA version.
(Thanks Norbert!)
If it can't optimize the original eexpression, may be it will be happy with
(v0-t*v0) + t*v1 ?
Great post, thanks. I would understand this should guarantee:
lerp(t, a, b) == a when a==b
lerp(0, a, b) == a
lerp(1, a, b) == b
Is that correct?