Apron setup for square block

I have a float array that I want to add to shared memory. It is almost like the convolutionSeparable example but with the difference that I can’t do it in two passes and I also only need one neighbor.

        -> [z z z z z]

[x x x] -> [z x x x z]

[x x x] -> [z x x x z]

[x x x] -> [z x x x z]

         -> [z z z z z]

The float array is [x x x x x x x x x]

and the resulting array should look like [z z z z z z x x x z z x x x z z x x x z z z z z z]

Where z should be values from neighboring blocks (if they exists). So this setup is comparable to a Gaussian filter that on one pass and with only one neighbor.

The input array could of course be of varying size as well as the blocksize so that must be considered too.

How would you guys do this in an efficient way?

It’s an interesting question. I don’t know, but here’s a thing to consider:

What about edge cases? Is it OK to repeat off-edge values, or drop them, or do you have to compensate at the edges, because of a normalization step for instance? Branching can affect these kernels quite a bit.

For edges it would be preferable if border values was repeated, but setting them to 0 would work too. I guess that it would be good to re-normalize (if borde values are present) but that can be ignored for now.

Solved it, but didn’t give any significant speed-up compared to global memory. Implemented a hybrid method using horizontal lookups in shared memory and global for vertical, which gave the best performance.

That’s pretty interesting, thanks for posting.

Is coalescing memory gmem reads the issue?

If so, try placing your input into a cudaArray and then fetch the array as well as apron values into smem from texture. This should give pretty good throughput as you’ll have good locality.

I would also use a 2D smem array. It’ll simplify your indexing code and in my experience the compiler does a good job generating efficient code for 2D smem array accesses.


Thanks paulius for the tip!

I now use a cudaArray instead of the gmem as before. Haven’t tried to add that to the smem with the apron yet but only switching to cudaArray and texture memory doubled the speed compared to global only. Really nice! Since i do several passes I have to copy the output to the array after every pass but that doesn’t seem to have too much overhead.

I do use 2d smem in the previous implementation which work well. But since i didn’t get that much of a speed-up last time I don’t expect it to do that now either. There is simply to few lookups i guess.

Thanks again!