newbie: Host to GPU overhead


I am new to CUDA programming and as a first project I made a very simple kernel performing color2gray on an image. My problem is that the although GPU version is faster than the CPU version, it is so only if you do not take the transfer of host memory to video memory into account. Running the conversion 1000x took 0.18 ms on the CPU and 0.12ms on the GPU. But if I count the memory copies also, the GPU version takes 2.5 ms. That is more than 10x slower than the CPU version!

In my real application I will need the result of CUDA operations on the host side. If I cannot avoid the overhead CUDA will not work for me. So my question is:

Is there a way to avoid this overhead ?

Thanks in advance,



Unfortunately, memory transfer kills you when dealing with very small portions of memory. But you may enhance your code using pinned memory in host code which is allocated with cudaMallocHost and you may try to to use asynchronous kernel launch and memory transfer using streams. Maybe idle times of GPU can be avoided and slow memory transfers can be hidden (works only if your GPU supports “device overlap”).

By the way: Is your kernel code optimized (especially concerning coalesced memory access) for execution on GPU. 0.12 ms versus 0.18 ms is a very weak performance.



Hi Navier,

Thank you for your answer. I will try your suggestions ( using pinned memory, streams ). I am certain that my kernel is not optimized, it is as basic as it can get, almost straight from the programming guide. Here it is:

global void kernel( unsigned char* src_data, unsigned char* dst_data, int srcStep, int dstStep )


    const unsigned int x   = blockIdx.x * blockDim.x + threadIdx.x;

const unsigned int y   = blockIdx.y * blockDim.y + threadIdx.y;

int o   = srcStep*y+3*x;

unsigned char val = (src_data[o]+src_data[o+1]+src_data[o+2])/3;

dst_data[dstStep*y+x] = rintf(val);


If this can be improved, I’d certainly like to learn how.



Look at the CUDA Training for general issues of CUDA.

Apart from coalescing your code is quite good. What will happen if you use larger amounts of data? What are your grid- and block-dimensions?

Hi Navier,

Wow. That worked! Thanks. I used pinned memory and streams and now the GPU version runs at 0.028 ms.

Both the blocksize and the gridsize are 16x16, but for the real application this could vary as the filter should work for arbitrary size images. I had a quick look at the training, but this coalescing still seams rather fuzzy to me ( I am afraid it was a long time ago since I had to think about memory layout…).