Read 4 bytes per thread into shared memory. So first thread will read RGBR. Next thread will read GBRG. And the last thread will read BRGB. These are coalesced reads. Now sync.
Once they are in shared memory, you can access the RGB data relatively more quickly than you can through global memory. Do your computation and then reverse the process when writing back to global memory.