Hello. I implement image processing on CUDA. And sometimes I have deal with color images in RGB format.
So I upload an image in RGB format to GPU. And read using uchar3 pixel = ((uchar3*) (ptr + pitch * y)); I write to output in the same way. I call many kernels, output from one is input to another.
One more way is to convert to RGBA format before procession and work with it.
My application is oriented on Fermi. I think that due to L1/L2 cache first method is faster then second because less amount of data is to read from global memory. Am I right? Is the same for CC1.3?