Concurrent Kernel and data transfer on multi-GPU systems

I am executing two kernels concurrently in two gpus; to make sure that they are actually running concurrently I’m using opencl events to time the kernel execution, and the wall clock also (including barriers) to time the total execution time, which in general is reasonable, say: TotalTime = KernelTime + BarrierTime (where TotalTime is timed using the wall clock and KernelTime is timed using opencl events).

Now, If I transfer info from GPU to GPU (using clEnqueueCopyBuffer) the Total Execution Time measured with the wall clock should be close to: TotalTime = KernelTime + CopyTime + BarrierTime (where KernelTime and CopyTime are timed using opencl events), but it is not, in fact it is much greater that the summation of the times measured using opencl events.

At this point I’m kind of clueless about this behaviour, has anybody ever used transfer from GPU to GPU ? caz I’m guessing that that is what is giving me some issues. Any ideas are welcome.


If you are trying to allocated directly between GPU to GPU, it’s AFAIK it’s not defined in the OpenCL spec (shouldn’t work), and to make it happen the driver has to allocated temporary memory on the host, copy data from the first GPU to the host and then copy the data from the host to the second GPU. It sounds like for some reason though the event doesn’t fire properly, I’m guessing that it’s firing once the data is off the first GPU and not when it arrives on the second one.

You could run the following test: call clFinish() (to synchronize), start a performance counter (for high precision timing), do the memory copy between GPUs, clFinish and stop counter. This should give you a feeling as to how long the copy is actually taking to see if the event timing is anywhere near accurate with it.

By the way CUDA has the ability to do this, but only under 64bit OS, tcc mode and some setup. If you answer these requirements and really need GPU to GPU transfers you should probably use CUDA (things involving complex host <-> device memory transfers are a bit limited with OpenCL at the moment)

I checked the result of the transfer, and they are correct. I’m guessing that in fact the GPU-GPU transfer is translated into a GPU-CPU-GPU, however the event time is totally misleading. Anyways, I’m gonna try what you suggested, however, I’m not really sure what you mean by “performance counter”, could you please clarify a bit more and perhaps put some pseudo-code ?


I’m assuming here that you are using windows.

What I mean is that it’s not simple to do high precision timing. Regular clock functions don’t have the resolution and have some drifts that can cause weird things such as negative times.

What you can do:

if (!QueryPerformanceFrequency(&tfreq))
	std::cerr << "No support for performance counters\n";
double freq = double(tfreq.QuadPart)/1000.0;

std::cerr << "Clock frequency: " << freq << std::endl;

LARGE_INTEGER start, stop;

// Your code here

std::cout << "Performance counters delta: " << double(stop.QuadPart - start.QuadPart)/freq << std::endl;

I’m using linux. I guess QueryPerformanceXXX is valid for windows only (VStudio) ? or am I wrong ?


yes, need to recall how to do the same under linux. Another option is to do the less accurate timing but time multiple iterations to minimize the error

wall clock you mean ?