new: cubic interpolation in CUDA cubic B-spline interpolation

I have created some code to perform cubic B-spline interpolation in CUDA. This code allows you to replace linear 2D and 3D texture filtering with cubic interpolation.

I have also included prefiltering to convert data samples into B-spline coefficients and several example programs + code.
The CUDA version is 327 times faster than a non-optimized CPU implementation on my PC! :)
interested? download the code from: CUDA cubic interpolation

regards,
Danny

edit: The website address has changed (my previous provider went bankrupt).

Nice! Is it similar to the process in GPU Gems?

Yes it is, except that I do not create a lookup table, but do all calculations on the fly.

That leads to a higher accuracy, while it hardly costs any extra processing time.

cheers,

Danny

V. cool!

I briefly attempted to get my head around that algorithm but failed. It’ll be fun to have a look at your code.

You may not have seen it, but the latest CUDA SDK includes a bicubic filtering sample that implements the method in GPU Gems.

This is usually faster since it only requires 4 bilinear texture lookups for a 2D bicubic filter (instead of the usual 16).

Oh! Nice, it pays to look at the new features.

I must admit that I also completely missed this example. When I started with my version, I used the SDK 2 beta, and there the example was not present yet, and when I later upgraded, I did not check the examples well enough :">

Since the SDK example also contains a benchmarking mode I quickly tried with both the SDK cubic interpolation and my code (it was very easy to integrate that one):

  • regular 2D linear interpolation: 1214 Mpixels/sec

  • SDK 2D cubic interpolation: 1198 Mpixels/sec

  • my 2D cubic interpolation: 1205 Mpixels/sec :)

It is striking that cubic interpolation (both versions) is hardly slower than linear interpolation. I guess that this is due to caching.

Of course my coding effort was not completely in vain, since I also offer 3D cubic interpolation and the Thevanaz prefilter.

Without the prefilter cubic interpolation has a smoothing effect (you can actually see that when you look closely at the SDK example output).

cheers,

Danny

Your numbers prompted me to look at this again, and it turns out there’s a big problem with the sample code - it doesn’t get memory coalescing on the writes (because it writes to a uchar array). Oops.

I fixed this by changing it to a uchar4 array, this will be in the next release. It just goes to show it’s always a good idea to profile your code!

I did the same thing, and these are the numbers I get on my GeForce 9800 GTX:

  • regular 2D linear interpolation: 4560 Mpixels/sec

  • SDK 2D cubic interpolation: 1995 Mpixels/sec

  • my 2D cubic interpolation: 2057 Mpixels/sec

Now there is a more sensible difference between cubic and linear interpolation. I guess that the fact that the performance gapp is less than a factor four is due to texture caching.

cheers,

Danny

Using -O2 option for NVCC usually triples your CPU performance. Check that out.

It is always to good to profile against O2 optimized CPU code.

Nonetheless, 327X looks rocking good! 327/3 as well… :) good luck!

IIRC there is only one bilinear unit for every two “normal” execution units, which would explain the missing 2x.

Straight-forwardly thinking you would need two texture fetch units for every processing unit to explain the missing factor two (so exactly the opposite), otherwise the processing units are just waiting for the texture fetch to finish.

Anyway, the real situation is a bit more complex: there are two streaming multiprocessor units (SM), a L1 texture cache and a texture unit on every texture processor cluster (TPC) for the GeForce 8800 architecture. Every SM possesses eight streaming processors (SP). See e.g. this powerpoint.

cheers, Danny

Hey guys… its interesting to go through this thread. I am a new programmer for CUDA. I have a 1D cubic interpolation code . Can that also be seeded upto 100 times ? I haven’t thought about it much as i am still learning cuda, but was interested in knowing if thats is actually possible.

Thanks all… :)

Nittin

I would expect that for 1D interpolation the speedup is less, since you would benefit less from smart data rehashing, which is done by the GPU for 2D and 3D textures. However, I have not tried 1D, so why don’t you give it a try and let us know…

kind regards,

Danny

Yes I certainly would, and will get back to you guys here , as soon as I have something running.

Thanks all,

Nittn

The cubic interpolation has been extended to efficiently deal with RGBA color data.
An example that performs on-the-fly cubic filtering on AVI playback illustrates this.

Also makefiles have been added for compilation on the Mac and Linux.
see here

cheers!
Danny

I have a 3D texture of water velocity data, where many grid points are land (i.e., null values). I would like to use CUDA and the cubic interpolation package to interpolate velocity at specific locations, but ignore the land data points.

Is this generally possible using the built-in interpolation hardware of GPU textures, or for the cubic interpolation package? I can set the land values to whatever I prefer, but not sure how to ignore them in the calculations.

Any suggestions would be much appreciated.

Hello I am new to Cuda and I found this thread useful since it has cubic interpolation, does anyone know how to test the code. I am not too familiar with texture. What type of arguments are taken into a texture structure. This will be really appreciated. thank you so much…

There are cuda sample codes that demonstrate texture usage. The programming guide has various topics concerning textures. There are also blogs that discuss various aspects of texture usage.

On a CUDA GPU the texture unit is a hardware unit that is principally a spatially-optimized cache. It requires explicit programming, both in host code and device code, to make use of it. In addition, for certain use cases, the texture unit can also do certain forms of interpolation. Depending on the use case, people may use the texture unit for one or both purposes: as a cache, or as a cache+interpolator. I don’t happen to know which is being referred to in this 13-15 year old thread or what exactly is in use in the linked repository.

I want to potentially warn against using textures. It is not mentioned in the programming guide table of instruction throughputs, but in my (limited) experience using the texture units leads to poor performance.

These slides show the reason: The texture unit can only generate 4 outputs per cycle!!

Unless all 32 threads in your warp access the same address/position in your texture, just implement linear/cubic interpolation using normal floating point operations and linear arrays.

Note: For spatially optimizing the cache, simply use 32x4 rectangular thread blocks or similar to localize your 2D accesses.