Using shared memory for bilinear filter CUDA scale Using shared memory for bilinear filter

I wrote a cuda program which takes an arbitrarily size RGBA image and copies into a smaller image. While it does this it must do a sort of “bilinear filter” to average the pixels. The operation is quite simple:

Determine scale/ratio of src vs. dst image (ie: 8x8 to 4x4 is scale 2).
For each source pixel, divide by scale^2, and add to destination. So in the above example we’d be taking 4 pixels, scaling one each by 0.25, and adding them together to produce the result pixel.

I’m running one thread per source pixel. The problem of course is this is completely not thread safe, many threads are reading-to and writing-from the same destination pixel. It’s also going to be terrible for memory coalescing.

I think what I need to do is allocate some shared memory and break this up into two passes. Wanted to get some feedback on what a good approach would be? I’m almost thinking I should rewrite it so that I run one thread per destination pixel, and do a gather-operation instead. This won’t be well coalesced either, though.

Run one thread per destination pixel. It would then read whatever source pixels it needs through a 2D texture read, and coalescing the write is easy.

Check out the convolutionSeparable example in the SDK.

Also, have you checked out using textures? You get bilinear interpolation ‘for free’ when you’re reading from textures. There’s an example for that, too.

This is the first thing i looked into, but it seems that CUDA does not support any fancy texturing modes oddly enough. Straight from the docs:

These functions fetch the region of linear memory bound to texture reference texRef using texture coordinate x. No texture filtering and addressing modes are supported. For integer types, these functions may optionally promote the integer to 32-bit floating point.

Also, even if it supported bilinear filtering, trying to copy into a texture that’s less than 50% of the size of the original texture, you would not get correct filtering any more since it would only consider the 4 neighbouring texels when doing the bilinear blend.

I’ll look into the convolutionSeparable sample.

You have a point there.

But, bilinear filtering among the 4 neighboring elements in the array CAN be done in CUDA. You quoted a portion of the manual that was referring only to 1D textures bound to device memory. Check again under the section for 2D (or 1D) textures bound to a “array” memory. (i.e. cudaBindTextureToArray). There aren’t as many texturing modes as OpenGL has, but you can do bilinear, clamp or wrap coordinates, or used normalized coordinates.

I take that back - the docs are a bit unclear, it seems there is some basic bilinear and linear filtering supported.

Linear texture filtering may be done only for textures that are configured to return floating-point data. It performs low-precision interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures and bilinear interpolation is performed for two-dimensional textures.

The problem here is that a) I’d have to convert back to integer, and B) it only does nearest neighbour filtering, so copying to a very small texture would lead to incorrect results.