GPU Pro Tip: Lerp Faster in C++

jwitsoe · June 11, 2015, 6:14am

Originally published at: https://developer.nvidia.com/blog/lerp-faster-cuda/

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units). Linear Interpolation (from Wikipedia) You can enable linear interpolation (also…

anon69886261 · June 11, 2015, 8:38pm

Good tip, but this is not something that users should have to do.

anon54429443 · June 11, 2015, 9:16pm

Nice post. What makes the FMA implementation higher precision? Is it for values of t close to 1?

anon95180265 · June 12, 2015, 2:41am

I assume by *this* you mean the optimization, not leaping in general.

Users don't *have* to optimize their lerping. They can lerp the traditional way with one extra instruction. It's not usually a huge deal. This tip is for those who really want to squeeze out performance. As I wrote, the compiler won't do the transform automatically because if it did this in all cases, there would be situations where numerical equality wouldn't be preserved. The compiler needs to be conservative.

The optimized function could be provided by a library -- and library authors are one target for this post.

I should also note that on hardware that doesn't support FMA, the approach is likely to be slower. On NVIDIA GPUs it is faster.

anon95180265 · June 12, 2015, 2:55am

Hey Nico! The FMA computes the multiply and add at full precision before rounding, while a separate multiply and add would incur two rounding operations, reducing accuracy. http://en.wikipedia.org/wik...
(Note that I don't think the lerp computation suffers from the unfortunate situation attributed to Kahan mentioned in that wikipedia article.)

Norbert Juffa (who suggested the tip) told me that in tests, the maximum error in the lerp variant with two FMA operations was <= 1ulp, while for the traditional method the error was <= 2 ulp. I didn't write that in the pro tip though because testing wasn't extensive. Establishing error bounds can be tricky. However, we can say that the FMA-based variant is more accurate on average.

anon54429443 · June 12, 2015, 6:15am

Hi Mark! Right! I forgot that the FMA unit works doesn't throw away any bits until after addition. I guess that helps when the arguments of the inner addition are very close to each other (catastrophic cancellation: http://en.wikipedia.org/wik..., in other words when t is close to one, right?

anon95180265 · June 12, 2015, 7:51am

I think C.C. only happens with subtraction, right? That would require either v0 or v1 to be negative, since t usually isn't. In which case sure, that can happen. :) But I guess the FMA approach could be more accurate even ignoring catastrophic cancellation cases.

anon5773341 · June 12, 2015, 10:14pm

Mark,

I talked about this exact thing at GTC 2015. You can see it in my slides from the presentation (page 33);

http://on-demand.gputechcon...

And yes Norbert was the one who suggested this to me as well.

anon69886261 · June 13, 2015, 11:16pm

What I am saying is that this is an optimization that a compiler should be smart enough to do, especially since a lot of compilers allow some option (e.g., fast-math in gcc https://gcc.gnu.org/wiki/Fl... ) that enables such unsafe optimizations.

Yes, compilers cannot do all the optimizations possible, especially when there are layers and layers of code, but it looks that this is something that can (and should) be supported.

anon54429443 · June 14, 2015, 6:12am

Well, in fma(-t, v0, v0) v0 and v0 are the same sign, and -t is typically negative, so C.C. will certainly happen when t ~= 1, i.e. -t*v0 is *almost* as large as v0, but opposite sign.

anon95180265 · June 15, 2015, 1:34am

Good point! But I thought you were comparing the accuracy of the FMA version to the non-FMA version, which is (1-t)*v0 + t*v1. In that version, the compiler will generate an FMUL, FADD, and FMA. And I don't think C.C will happen in that instruction sequence unless one of the three parameters is negative.

anon95180265 · June 15, 2015, 1:35am

Cool! Norbert knows his stuff.

anon95180265 · June 15, 2015, 1:49am

nvcc has fast math: --use_fast_math. But this optimization still doesn't happen with that enabled, I just checked. Have you checked that this does happen with `gcc -funsafe-math-optimizations`? I wouldn't be surprised if it does not.

anon54429443 · June 15, 2015, 4:42pm

Fair enough :) Thanks for clarifying!

anon5773341 · June 16, 2015, 6:56pm

'such unsafe optimizations" ? Really? Unsafe in what regard?

The best demonstration of the performance improvement of using the 'unsafe' fast math flag in CUDA is to compile the following with and without that flag:

https://github.com/OlegKoni...

After much testing I found very little difference between values generated using that 'unsafe' flag when compared to equivalent 64 bit MATLAB computation.

Also the accuracy of those CUDA math functions used by that flag is detailed here:

http://docs.nvidia.com/cuda...

anon70042232 · June 20, 2015, 6:48pm

I usually use the form v0+t*(v1-v0) which can be more accurate for small differences - you said add followed by multiply, but fma seems in the reverse order?
Can this form be as fast?

anon95180265 · June 22, 2015, 12:41am

Stuart, the form you suggest is precisely the form that Wikipedia points out is an "Imprecise method which does not guarantee v = v1 when t = 1, due to floating-point arithmetic error." (https://en.wikipedia.org/?t... I would also think that it would be prone to catastrophic cancellation when v1 is orders of magnitude larger than v0.

anon95180265 · June 29, 2015, 1:42am

I checked with Norbert Juffa, who says the accuracy of this alternate form is superior only when v0 and v1 are within a factor of two of each other. That is due to the Sterbenz lemma which states that the subtraction of two positive floating-point numbers a and b, that is, a-b, is exact for 0.5*a <= b <= 2*a. Therefore the maximum error in the final result is simply the error of the FMA, thus 0.5 ulps. However, the moment you allow v0,v1 to be apart by more than a factor of 2, the accuracy of v0+t*(v1-v0) suffers: if they can be within a factor of four of each other, the maximum error increases to 1.5 ulps. If they are within a factor of eight, 2.5 ulps. Etc.

The form with two FMAs from the post has error <= 1 ulp across a very wide range of arguments, so for a general purpose lerp code, we think this is still the most robust form. But clearly there is a trade-off, and if it is known in advance that the interpolants are always within a factor of two of each other (e.g. for interpolating in a known table, rather than sensor data), the proposed alternative form would be the way to go.

Regarding performance: this form would generate an FADD and FMA, so it should be as fast as the 2-FMA version.

(Thanks Norbert!)

anon28810180 · January 19, 2016, 3:13am

If it can't optimize the original eexpression, may be it will be happy with
(v0-t*v0) + t*v1 ?

anon14500171 · October 30, 2017, 11:58pm

Great post, thanks. I would understand this should guarantee:

lerp(t, a, b) == a when a==b
lerp(0, a, b) == a
lerp(1, a, b) == b

Is that correct?

Topic		Replies	Views
Trilinear Interpolation without Textures CUDA Programming and Performance	9	2234	March 20, 2019
Some questions about texture memory CUDA Programming and Performance	8	1647	March 5, 2019
Accuracy of 1D linear interpolation by CUDA texture interpolation CUDA Programming and Performance	25	14628	January 29, 2013
Faster and more accurate implementation of log1pf() CUDA Programming and Performance	15	3299	January 25, 2017
Examining the generated .ptx file CUDA Programming and Performance	13	2397	October 24, 2014
Using tex2D for unsigned short/char CUDA Programming and Performance	14	3688	November 15, 2017
A more accurate, performance-competitive implementation of expf() CUDA Programming and Performance	24	8258	November 19, 2017
GPU Code and CPU Code output not matching till machine precision (i.e. 13 decimals places) CUDA Programming and Performance	22	824	August 9, 2023
Simulate tex2D fetch : what is implementation of linear interpolation? CUDA Programming and Performance	3	1194	August 30, 2021
Mythical Tflops CUDA Programming and Performance	11	1113	January 14, 2019

GPU Pro Tip: Lerp Faster in C++

Related topics