General Convolution


I saw in the SDK that there is sample code for separable convolution and for FFT convolution that is efficient for big kernel sizes, but is there any library code for a general (unseparable) convolution that is efficient for small kernel sizes?


you can use 2d texture fetching to make it.

It will faster than use share memery because you don’t have to consider overlap problem.

there is alreay some example in the forum.

Can you point me to where that is? I can’t seem to find it with the search.

Anyone want to chime in on using textures for general 2d convolution? Are there examples of this? Is this really optimal?

It should be quite OK in my opinion, from what little I’ve seen

  • There’s an example in the SDK, convolutionTexture I think it was? There are two or three convolution sample projects in NVIDIA_CUDA_SDK/projects

At the very least it will give you some best-practice rules of thumb

Ive posted an implementation here:…;hl=convolution

It looks like the filter is in constant memory, which i think should work well IF every thread convolves with the same filter value at every step of the loop, which should be the case and it will use the broadcast mechanism.
Not sure how shared memory would do, it would do fine i assume.


I’ve been working for quite some time now on a implementation for general convolution based on the separable convolution (that can be found in the SDK).

Basically, what I did is to use the exact same indexes (appron_start, appron_end… and so on), but for the dimensions x and y. (in the SDK example, it is only for one dimension at a time).

It works fine as long as i don’t need to use a FOR loop to convolve the pixel (smemPosx, smemPosy) with the kernel. If i write for example something like:






which corresponds to a convolution with the filter

[0 -1-0;

-1 4 -1

0 -1 0];

this way of coding the convolution works fine, it got the result expected.

But with the following code, where KERNEL is the float array

[0 -1 0;

-1 4 -1;

0 -1 0]

// Check  that we can write.... 

const int writePosx=tilestart_x+threadIdx.x

const int writePosy=tilestart_y+threadIdx.y

const int smemPosx=writePosx-apronstart_x

const int smemPosy=writePosy-apronstart_y

float sum=0.0f;

float w_t=0.0f

for (int k=-KERNEL_RADIUS;k<=KERNEL_RADIUS;k++){

  for (int l=-KERNEL_RADIUS;l<=KERNEL_RADIUS;l++){


w_t+=(float) 1;




This code works fine as long as the TILE length is equal to 1, as soon as I change the length of the tile( a large length allows the algorithm to run faster), the shape of the TILE (a rectangle of dimension TILE_X TILE_Y), is easily noticeable… but it shouldn’t be and I don’t understand why the program generates such images.

If someone has ever tried and succeeded in creating the 2D convolution based the one in the SDK, i’ll be happy to hear about it !

Thx in advance !


Alright, so I found where my mistake is… as usual it is an index error. >.< I’m just surprised that my kernel didn’t crash before because I thought it would provoke a segmentation fault (because the indexes for the array KERNEL were <0)

so in my for loop, I should have written

for (int k=-KERNEL_RADIUS;k<=KERNEL_RADIUS;k++){

  for (int l=-KERNEL_RADIUS;l<=KERNEL_RADIUS;l++){


w_t+=(float) 1;



So the lesson that should be retained from that time-costing bug is :Always check thoroughly the indexes and verify it by hand !!! (use your brain, a pencil and unroll the thread and bloc indexes by yourself!)