I saw in the SDK that there is sample code for separable convolution and for FFT convolution that is efficient for big kernel sizes, but is there any library code for a general (unseparable) convolution that is efficient for small kernel sizes?
It looks like the filter is in constant memory, which i think should work well IF every thread convolves with the same filter value at every step of the loop, which should be the case and it will use the broadcast mechanism.
Not sure how shared memory would do, it would do fine i assume.
I’ve been working for quite some time now on a implementation for general convolution based on the separable convolution (that can be found in the SDK).
Basically, what I did is to use the exact same indexes (appron_start, appron_end… and so on), but for the dimensions x and y. (in the SDK example, it is only for one dimension at a time).
It works fine as long as i don’t need to use a FOR loop to convolve the pixel (smemPosx, smemPosy) with the kernel. If i write for example something like:
which corresponds to a convolution with the filter
[0 -1-0;
-1 4 -1
0 -1 0];
this way of coding the convolution works fine, it got the result expected.
But with the following code, where KERNEL is the float array
[0 -1 0;
-1 4 -1;
0 -1 0]
// Check that we can write....
const int writePosx=tilestart_x+threadIdx.x
const int writePosy=tilestart_y+threadIdx.y
const int smemPosx=writePosx-apronstart_x
const int smemPosy=writePosy-apronstart_y
float sum=0.0f;
float w_t=0.0f
for (int k=-KERNEL_RADIUS;k<=KERNEL_RADIUS;k++){
for (int l=-KERNEL_RADIUS;l<=KERNEL_RADIUS;l++){
sum+=sdata[smemPosx+k][smemPosy+l]*KERNEL[smemPosx-k][smemPosy+l];
w_t+=(float) 1;
}
}
result=sum/w_t;
This code works fine as long as the TILE length is equal to 1, as soon as I change the length of the tile( a large length allows the algorithm to run faster), the shape of the TILE (a rectangle of dimension TILE_X TILE_Y), is easily noticeable… but it shouldn’t be and I don’t understand why the program generates such images.
If someone has ever tried and succeeded in creating the 2D convolution based the one in the SDK, i’ll be happy to hear about it !
Alright, so I found where my mistake is… as usual it is an index error. >.< I’m just surprised that my kernel didn’t crash before because I thought it would provoke a segmentation fault (because the indexes for the array KERNEL were <0)
so in my for loop, I should have written
for (int k=-KERNEL_RADIUS;k<=KERNEL_RADIUS;k++){
for (int l=-KERNEL_RADIUS;l<=KERNEL_RADIUS;l++){
sum+=sdata[smemPosx+k][smemPosy+l]*KERNEL[KERNEL_RADIUS-k][KERNEL_RADIUS-l];
w_t+=(float) 1;
}
}
So the lesson that should be retained from that time-costing bug is :Always check thoroughly the indexes and verify it by hand !!! (use your brain, a pencil and unroll the thread and bloc indexes by yourself!)