Data transfer between CPU and GPU

Hi there, I have two questions:

First question: I need to transfer data from GPU to CPU and CPU to GPU. To compute the transfer rate I’m timing the transfers using OpenCL Events; It looks like the transfer from GPU to CPU is faster than the transfer from CPU to GPU (12.2GB/s vs 11GB/s). I read somewhere that this behavior is normal, but don’t know why: is it because restrictions imposed by the PCIe or the GPU ?. Any explanation and links will be useful. BTW: I’m using a NVidia C2070 GPU and a PCIe x16 2nd Generation; and the buffer at the host is pinned memory

Second question is: What I actually need is to transfer data from GPU1 to GPU2, so I’m transferring by doing 2 transfers: GPU-CPU and then CPU-GPU using pinned memory. Is there any way to transfer GPU-GPU directly ?. Both GPUs are C2070.


Max bandwidth for your PCIe x16 Gen2 Tesla C2070 is 8GB/s each way. So both reported bandwidth are suspect to me. I seem to recall running into reporting issues in the NVIDA oclBandwithTest application last year. Perhaps there is still a bug there if that is what you are using.

The actual practical limit around 5.5GB/s on pinned memory, 8GB/s is theoretical and doesn’t take into account communication protocol overhead.

As for GPU-GPU transfers, if both GPUs are on the same PCI-E bus, the OS is 64bit and if under windows 7, the GPUs are in TCC mode (under linux / XP it’s not needed), and the application is compiled for 64bits, you can do a direct copy between GPUs bypassing the CPU memory (check if Device supports Unified Addressing (UVA) is enabled). This is with CUDA 4

Thanks, but how to implement it on OpenCL caz I was having issues ?

Having an issue with what?

Respected Sir,

Recently I am facing a problem in OpenCL which I am notable find solution at the movement , well I am explaining the type of situation with an example below.

int previous_pixel;





if(input_buffer[fs]==some value)


else if(previous_pixel!=0)


//operation being done


calculate some value “h” here than,



well this is the problem I am facing, how can I solve this dependency problem for previous_pixel. Its taking the value zero for all threads.

Thanks in advance

Best regards


I’m not fully following the code, but it doesn’t seem that previous pixel is set anywhere if it’s not set to zero, so it takes on some junk value that might very well be zero

I got the solution for that , I am running one row in one thread because dependency was there only inside a row, each row were independent so got the solution as shown below.

Initially I was doing like this








Global_worksize=(height)//taking only height number of threads







}//end of for loop