newbie: Host to GPU overhead

r.odobenus · April 23, 2009, 7:00am

Hello,

I am new to CUDA programming and as a first project I made a very simple kernel performing color2gray on an image. My problem is that the although GPU version is faster than the CPU version, it is so only if you do not take the transfer of host memory to video memory into account. Running the conversion 1000x took 0.18 ms on the CPU and 0.12ms on the GPU. But if I count the memory copies also, the GPU version takes 2.5 ms. That is more than 10x slower than the CPU version!

In my real application I will need the result of CUDA operations on the host side. If I cannot avoid the overhead CUDA will not work for me. So my question is:

Is there a way to avoid this overhead ?

Thanks in advance,

Rob

navier-stokes · April 23, 2009, 7:21am

Hi!

Unfortunately, memory transfer kills you when dealing with very small portions of memory. But you may enhance your code using pinned memory in host code which is allocated with cudaMallocHost and you may try to to use asynchronous kernel launch and memory transfer using streams. Maybe idle times of GPU can be avoided and slow memory transfers can be hidden (works only if your GPU supports “device overlap”).

By the way: Is your kernel code optimized (especially concerning coalesced memory access) for execution on GPU. 0.12 ms versus 0.18 ms is a very weak performance.

Regards

Navier

r.odobenus · April 23, 2009, 10:40am

Hi Navier,

Thank you for your answer. I will try your suggestions ( using pinned memory, streams ). I am certain that my kernel is not optimized, it is as basic as it can get, almost straight from the programming guide. Here it is:

global void kernel( unsigned char* src_data, unsigned char* dst_data, int srcStep, int dstStep )

{

    const unsigned int x   = blockIdx.x * blockDim.x + threadIdx.x;

const unsigned int y   = blockIdx.y * blockDim.y + threadIdx.y;

int o   = srcStep*y+3*x;



unsigned char val = (src_data[o]+src_data[o+1]+src_data[o+2])/3;

dst_data[dstStep*y+x] = rintf(val);

}

If this can be improved, I’d certainly like to learn how.

Regards,

Rob

navier-stokes · April 23, 2009, 11:30am

Hi Navier,

Thank you for your answer. I will try your suggestions ( using pinned memory, streams ). I am certain that my kernel is not optimized, it is as basic as it can get, almost straight from the programming guide. Here it is:

global void kernel( unsigned char* src_data, unsigned char* dst_data, int srcStep, int dstStep )

{
    const unsigned int x   = blockIdx.x * blockDim.x + threadIdx.x;

const unsigned int y   = blockIdx.y * blockDim.y + threadIdx.y;

int o   = srcStep*y+3*x;



unsigned char val = (src_data[o]+src_data[o+1]+src_data[o+2])/3;

dst_data[dstStep*y+x] = rintf(val);
}

If this can be improved, I’d certainly like to learn how.

Regards,

Rob

Look at the CUDA Training for general issues of CUDA.

Apart from coalescing your code is quite good. What will happen if you use larger amounts of data? What are your grid- and block-dimensions?

r.odobenus · April 23, 2009, 12:55pm

Hi Navier,

Wow. That worked! Thanks. I used pinned memory and streams and now the GPU version runs at 0.028 ms.

Both the blocksize and the gridsize are 16x16, but for the real application this could vary as the filter should work for arbitrary size images. I had a quick look at the training, but this coalescing still seams rather fuzzy to me ( I am afraid it was a long time ago since I had to think about memory layout…).

Regards,

Rob

Topic		Replies	Views
cuMemcpy Overhead cuMemcpy call has unacceptable overhead CUDA Programming and Performance	6	3026	July 31, 2008
Host to Device memcpy overhead CUDA Programming and Performance	2	1170	March 17, 2009
Reducing overhead Is there a way to minimize CUDA overhead CUDA Programming and Performance	3	3308	August 9, 2008
Slow memory transfers CUDA Programming and Performance	7	2040	May 23, 2011
Memory copy improvement ? CUDA Programming and Performance	6	3116	April 25, 2012
cudaMemcpy takes 30% of my project time. CUDA Programming and Performance	5	4430	July 20, 2009
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2688	April 27, 2009
Copies between CPU and GPU CUDA Programming and Performance	8	5391	November 3, 2009
4 byte transfers from device to host are extremely slow. CUDA Programming and Performance	7	6936	May 25, 2009
Performance in basic algorithm Why isn't faster? CUDA Programming and Performance	4	1690	January 9, 2009

newbie: Host to GPU overhead

Related topics