I am working on a color image processing code. I load the image with CPU and pass it to GPU with cudaMemcpyHostToDevice, process it within GPU and connect it to opengl to render the output image. The image is loaded in a unsigned char pointer. If the image is WxH, then the size of the image is WxHx3, due to the RGB channel.
What I did first is to convert this unsigned char pointer to uchar4 pointer by padding a zero alpha channel, with CPU. I saw many examples in CUDA SDK doing this, I did it also. I then pass this uchar4 pointer to CUDA. It works great.
I originally thought I can improve it, the reason is that in order to convert this unsigned char image pointer to uchar4 pointer, we need to go through each pixel in the image and this process is done in CPU. So I think why not just pass unsigned char pointer to the GPU directly. I thought this should be faster, but it turns out to be much slower… Why is that?