A question regarding the speed


I am working on a color image processing code. I load the image with CPU and pass it to GPU with cudaMemcpyHostToDevice, process it within GPU and connect it to opengl to render the output image. The image is loaded in a unsigned char pointer. If the image is WxH, then the size of the image is WxHx3, due to the RGB channel.

What I did first is to convert this unsigned char pointer to uchar4 pointer by padding a zero alpha channel, with CPU. I saw many examples in CUDA SDK doing this, I did it also. I then pass this uchar4 pointer to CUDA. It works great.

I originally thought I can improve it, the reason is that in order to convert this unsigned char image pointer to uchar4 pointer, we need to go through each pixel in the image and this process is done in CPU. So I think why not just pass unsigned char pointer to the GPU directly. I thought this should be faster, but it turns out to be much slower… Why is that?


Which GPU are you using? Devices of different compute capability have very different coalescing requirements.

You could map the CPU image into the GPU address space and use a kernel to copy the image to device memory, padding it on the fly.

I have a quadro 600. Do you mean instead of padding unsigned char to uchar4 in CPU, I can write a simple kernel to pad it in a parallel way in GPU? That’s a good idea!~!

So you have a compute capability 2.1 device. I’m surprised the kernel runs much slower with unaligned data. Can you post code?