I’ve been working on convolution for the last 3 months and from my experience, there’s a very delicate balance between
reducing #loads / improving locality
memory locality (coalescing)
when doing direct convolution on current GPUs. Its not hard to get 2 of the above, but getting all 3 is quite difficult.
The filter sizes I deal with are 129 x 47 (yes, I also use FFT and separable convolution), so even if you do register blocking, I definitely still need to
use shared RAM as a cache for global RAM for better bandwidth
For 7x7 filter, the #global loads might not increase dramatically if you don’t use shared RAM.
Not using shared RAM might have an advantage because it can increase concurrency. I often found when using register blocking, shared RAM is too small that I have to reduce the #threads (concurrency) so that everything fits. This limitation should be gone with Fermi, which will allow 48KiB SMEM.
Since each sample in your case is only 1bit, the space issue shouldn’t be a problem.
For convolving a 8x8 block of the filter, you don’t need to load a 15x15 block all at once. There’s a clever way to load only 8 elements each iteration after the 0th, by working your way around the 15x15 block you wish to load in a spiral. This would save some space, but the code becomes a monstrosity due to all the hard coded operations.
For a 1D filter, it’s possible to achieve only 1 load/element:
image: i0 i1 i2 i3 i4 i5
filter: f0 f1 f2 f3
i0 i1 i2 *
after iteration 0:
i0 i1 i2 i3
after iteration 1:
i4 i1 i2 i3
after iteration 2:
i4 i5 i2 i3
after iteration 3:
i4 i5 i6 i3
This can be easily extended to 2D.
One question I have is how do you convert your presumably 1bit/pixel input to representations the GPU can do multiply-add on? I don’t know of any instructions like SSE’s unpack nor any population count instructions.