Summed area table, blur filtering

Hi,

I’m doing a quick implementation of SAT on cuda, simply to compare performance benefits compared to a shader-only implementation. (bachelor thesis)

As I expected the row construction is very fast with cuda but building a blurred image from the SAT is very slow. This comes mainly from it’s random memory access pattern. Namely, if used in combination with a depth map for creation of Depth of Field effect, each pixel will appear like a random memory retrieval from the global memory.

I’ve looked at CUDPP implementation, but they simply use a fragment shader for the last step.

Do any of you, got ideas to improve memory fetching or solve this nasty problem?
I’ve already thought about shared memory, but there would be a whole lot of idle threads doing nothing.

Basic implementation:
http://code.google.com/p/random-bits/sourc…cudaSAT/main.cu

(other ‘usefully’ files, in the root directory)
http://code.google.com/p/random-bits/sourc…or/src/cudaSAT/