My aim is to split a 1280x960 image of ints into small 4x4 regions and find the maximum pixel value in each 4x4 region (with a margin around the edge of the overall image). There is a second processing stage on each 4x4 region but I omit that for now. I hoped CUDA (running on a Jetson TX1) would help speed this up…
My initial design is as follows:
Split the image up into 16x30 overlapping blocks of 96x38 pixels. So 480 blocks in total – each processing 3648 pixels.
Then I used 20x8 (160 total) threads to load the 3648 pixels into shared memory. So 3648x4 = 14592 bytes. Each block can then be split up into 160 4x4 regions – so each thread find a maximum in its own 4x4 region.
I thought this would work well because the memory loads can be coalesced (96 is a multiple of 32), 160 threads can be used for loading AND splitting with hardly any sitting idle for long and total threads is a multiple of 32.
I have this design running, and a reference CPU version running as well. Unfortunately, the CPU version is twice as fast as the GPU version…. A bit of profiling reports low occupancy and it might be to do with using too much shared memory per block. Could anyone perhaps elaborate on this, or suggest a better design?
Many thanks in advance.