convolution separable expansion howto?

Hi,

for my actual project I need to downscale a 2D array of magnitude values. For this purpose I use an anti aliasing filter before picking out the values needed.

The filter kernel I derived form the convolution separable example. Now I’m hitting limits of my Tesla C870 if I need a very large filter.

For example if I have 131072 datapoints in row direction and want to downscale them to 1024 for displaying purposes I would need a filter radius of

3 * (131072/1024) = 384 (the factor 3 is for blurring)

Now if I take the convolution separable I have to create a BlockDim of (2*radius + tile width) which in this case would be 896.

For 896 > 512 (max block size) the execution fails. As I’m also hitting memory limits in some of these cases this is no big problem at the moment. But if I think further on to the Telsa C1070 which is expanding my opportunities this limitation would be really a bottleneck.

So now does anyone who has dealt with the same example know how to avoid this limit easily?

One guess would be a two step downscaling which would not be so great because of performance and memory reasons. And even this one would hit limits anywhere (though they might be far away from anything useful for now).

Another option might be to use the threads calculating the output also for loading the needed data outside the tile from global to shared memory (for the threads doing this are idle through computation).

Any other suggestions?

Thanks,

Vrah