Discussion: Registers or SHM (e.g. for image convolution)

Hi!

Imagine a 7x7 image convolution.

To implement this task efficiently, it is obvious not to use one independent thread per pixel (=49.0 pixel loads/pixel).

better way: load a 14x14 block for a 8x8 convolution block (=3.06 loads/pixel)

one thing: the convolution result is binary, just 0 or 1. so 8 pixels are packed into one byte and therefore one thread should process at least 8 pixels.

would you use registers or shared memory?

register implementation:

no shm is used. the 14x14 block is loaded to registers, so one thread processes 8x8 pixel and results in 8 byte.

=> heavy threads but less blocks

shm implementation:

the 14x14 block is loaded to shm. one thread processes 8 pixels and results in 1 byte.

=> leaner, but more threads

or would u try a completely different layout?

greets,

moik

I’ve been working on convolution for the last 3 months and from my experience, there’s a very delicate balance between

  1. concurrency (occupancy)

  2. reducing #loads / improving locality

  3. memory locality (coalescing)

when doing direct convolution on current GPUs. Its not hard to get 2 of the above, but getting all 3 is quite difficult.

The filter sizes I deal with are 129 x 47 (yes, I also use FFT and separable convolution), so even if you do register blocking, I definitely still need to

use shared RAM as a cache for global RAM for better bandwidth

For 7x7 filter, the #global loads might not increase dramatically if you don’t use shared RAM.

Not using shared RAM might have an advantage because it can increase concurrency. I often found when using register blocking, shared RAM is too small that I have to reduce the #threads (concurrency) so that everything fits. This limitation should be gone with Fermi, which will allow 48KiB SMEM.

Since each sample in your case is only 1bit, the space issue shouldn’t be a problem.

For convolving a 8x8 block of the filter, you don’t need to load a 15x15 block all at once. There’s a clever way to load only 8 elements each iteration after the 0th, by working your way around the 15x15 block you wish to load in a spiral. This would save some space, but the code becomes a monstrosity due to all the hard coded operations.

For a 1D filter, it’s possible to achieve only 1 load/element:

image:	i0   i1   i2   i3  i4  i5

filter:	  f0   f1   f2   f3

reg block:

initial:

i0   i1   i2   *

after iteration 0:

i0   i1   i2   i3

after iteration 1:

i4   i1   i2   i3

after iteration 2:

i4   i5   i2   i3

after iteration 3:

i4   i5   i6   i3

...

This can be easily extended to 2D.

One question I have is how do you convert your presumably 1bit/pixel input to representations the GPU can do multiply-add on? I don’t know of any instructions like SSE’s unpack nor any population count instructions.