Best access patterns for 8bit data on Compute 1.0/1.1 hardware

I’m currently in the midst of optimizing various kernels that we run on 8bit gray scale images, most of which I’ve coalesced by reading/writing 4 pixels per thread (one 32bit word)…

However, we have a kernel which copies 1 channel from a 24bit RGB channel image to an 8bit gray scale image and I’m currently having trouble figuring out a ‘clean’/efficient way of coalescing the memory loads/stores from the 24bit RGB image.

Has anyone had to deal with a similar case before? and if so, how did you approach this problem?

You can’t avoid loading the extra 8 bit values, since the hardware accesses the memory by word.

So load an entire stripe of RGB, still packed, into your warp, then sort it out in shared memory. The crucial part is to make the

global memory read aligned and coalesced. The overhead of any rearrangement later is negligable.

unsigned char *RGB=(some pointer in global memory. Assume it's aligned to a 16 word boundary) 

unsigned long *longpntr=(unsigned long *)RGB; // we'll access it by words

unsigned long data=longpntr[threadidx.x]; // coalesced global read of 32 words = 128 bytes

__shared__ unsigned char redChannnel[64];

// a thread holds one word, which has a different phase, and may be RGBR  GBRG  BRGB. Pick the right one.

// these conversions are easy but it's also easy to get off-by-one problems..

int index=1+(4*threadidx.x)/3; // Where my LAST R will be saved (I may hold 2)

int phase=threadidx.x%3;  // 0=RGBR  1=GBRG  2=BRGB

redChannel[index]=0xFF & (data>>(phase*8));

if (0==phase) { // I hold 2, need to dump the extra one

  redChannel[index-1]=data>>24;

}

The above code is completely untested, I just typed it here in the browser, but you get the idea.

The strategy is simple but watch out for those off-by-one errors in stuff like that index= line.

Thanks for your example :)

Following from the basic principle of your example code, I managed to completely coalesce the reads (4 RGB pixels per thread, totaling 3 32bit words) into smem, unpack them in the thread, and write them back out with coalescing (again, writing 4 pixels per thread).

My original kernel took approximately 350us (due to the fact it wasn’t optimized for RGB->GRAY8 copies, I was using a generic byte-wise interlaced copying kernel), this new kernel now takes 40us - so that’s an 8.75x speedup in general for that case. :)
(Better yet, I run this kernel 80+ times a second (and growing), so that’s a significant speed up there)

Good thread.