new: cubic interpolation in CUDA cubic B-spline interpolation

I have created some code to perform cubic B-spline interpolation in CUDA. This code allows you to replace linear 2D and 3D texture filtering with cubic interpolation.

I have also included prefiltering to convert data samples into B-spline coefficients and several example programs + code.
The CUDA version is 327 times faster than a non-optimized CPU implementation on my PC! :)
interested? download the code from: CUDA cubic interpolation


edit: The website address has changed (my previous provider went bankrupt).

Nice! Is it similar to the process in GPU Gems?

Yes it is, except that I do not create a lookup table, but do all calculations on the fly.

That leads to a higher accuracy, while it hardly costs any extra processing time.



V. cool!

I briefly attempted to get my head around that algorithm but failed. It’ll be fun to have a look at your code.

You may not have seen it, but the latest CUDA SDK includes a bicubic filtering sample that implements the method in GPU Gems.

This is usually faster since it only requires 4 bilinear texture lookups for a 2D bicubic filter (instead of the usual 16).

Oh! Nice, it pays to look at the new features.

I must admit that I also completely missed this example. When I started with my version, I used the SDK 2 beta, and there the example was not present yet, and when I later upgraded, I did not check the examples well enough :">

Since the SDK example also contains a benchmarking mode I quickly tried with both the SDK cubic interpolation and my code (it was very easy to integrate that one):

  • regular 2D linear interpolation: 1214 Mpixels/sec

  • SDK 2D cubic interpolation: 1198 Mpixels/sec

  • my 2D cubic interpolation: 1205 Mpixels/sec :)

It is striking that cubic interpolation (both versions) is hardly slower than linear interpolation. I guess that this is due to caching.

Of course my coding effort was not completely in vain, since I also offer 3D cubic interpolation and the Thevanaz prefilter.

Without the prefilter cubic interpolation has a smoothing effect (you can actually see that when you look closely at the SDK example output).



Your numbers prompted me to look at this again, and it turns out there’s a big problem with the sample code - it doesn’t get memory coalescing on the writes (because it writes to a uchar array). Oops.

I fixed this by changing it to a uchar4 array, this will be in the next release. It just goes to show it’s always a good idea to profile your code!

I did the same thing, and these are the numbers I get on my GeForce 9800 GTX:

  • regular 2D linear interpolation: 4560 Mpixels/sec

  • SDK 2D cubic interpolation: 1995 Mpixels/sec

  • my 2D cubic interpolation: 2057 Mpixels/sec

Now there is a more sensible difference between cubic and linear interpolation. I guess that the fact that the performance gapp is less than a factor four is due to texture caching.



Using -O2 option for NVCC usually triples your CPU performance. Check that out.

It is always to good to profile against O2 optimized CPU code.

Nonetheless, 327X looks rocking good! 327/3 as well… :) good luck!

IIRC there is only one bilinear unit for every two “normal” execution units, which would explain the missing 2x.

Straight-forwardly thinking you would need two texture fetch units for every processing unit to explain the missing factor two (so exactly the opposite), otherwise the processing units are just waiting for the texture fetch to finish.

Anyway, the real situation is a bit more complex: there are two streaming multiprocessor units (SM), a L1 texture cache and a texture unit on every texture processor cluster (TPC) for the GeForce 8800 architecture. Every SM possesses eight streaming processors (SP). See e.g. this powerpoint.

cheers, Danny

Hey guys… its interesting to go through this thread. I am a new programmer for CUDA. I have a 1D cubic interpolation code . Can that also be seeded upto 100 times ? I haven’t thought about it much as i am still learning cuda, but was interested in knowing if thats is actually possible.

Thanks all… :)


I would expect that for 1D interpolation the speedup is less, since you would benefit less from smart data rehashing, which is done by the GPU for 2D and 3D textures. However, I have not tried 1D, so why don’t you give it a try and let us know…

kind regards,


Yes I certainly would, and will get back to you guys here , as soon as I have something running.

Thanks all,


The cubic interpolation has been extended to efficiently deal with RGBA color data.
An example that performs on-the-fly cubic filtering on AVI playback illustrates this.

Also makefiles have been added for compilation on the Mac and Linux.


I have a 3D texture of water velocity data, where many grid points are land (i.e., null values). I would like to use CUDA and the cubic interpolation package to interpolate velocity at specific locations, but ignore the land data points.

Is this generally possible using the built-in interpolation hardware of GPU textures, or for the cubic interpolation package? I can set the land values to whatever I prefer, but not sure how to ignore them in the calculations.

Any suggestions would be much appreciated.