Concurrent Kernel and data transfer on multi-GPU systems

lsolano · November 2, 2011, 3:31pm

I am executing two kernels concurrently in two gpus; to make sure that they are actually running concurrently I’m using opencl events to time the kernel execution, and the wall clock also (including barriers) to time the total execution time, which in general is reasonable, say: TotalTime = KernelTime + BarrierTime (where TotalTime is timed using the wall clock and KernelTime is timed using opencl events).

Now, If I transfer info from GPU to GPU (using clEnqueueCopyBuffer) the Total Execution Time measured with the wall clock should be close to: TotalTime = KernelTime + CopyTime + BarrierTime (where KernelTime and CopyTime are timed using opencl events), but it is not, in fact it is much greater that the summation of the times measured using opencl events.

At this point I’m kind of clueless about this behaviour, has anybody ever used transfer from GPU to GPU ? caz I’m guessing that that is what is giving me some issues. Any ideas are welcome.

Thanks,

laughingrice · November 6, 2011, 9:58am

If you are trying to allocated directly between GPU to GPU, it’s AFAIK it’s not defined in the OpenCL spec (shouldn’t work), and to make it happen the driver has to allocated temporary memory on the host, copy data from the first GPU to the host and then copy the data from the host to the second GPU. It sounds like for some reason though the event doesn’t fire properly, I’m guessing that it’s firing once the data is off the first GPU and not when it arrives on the second one.

You could run the following test: call clFinish() (to synchronize), start a performance counter (for high precision timing), do the memory copy between GPUs, clFinish and stop counter. This should give you a feeling as to how long the copy is actually taking to see if the event timing is anywhere near accurate with it.

By the way CUDA has the ability to do this, but only under 64bit OS, tcc mode and some setup. If you answer these requirements and really need GPU to GPU transfers you should probably use CUDA (things involving complex host <-> device memory transfers are a bit limited with OpenCL at the moment)

lsolano · November 6, 2011, 4:01pm

I checked the result of the transfer, and they are correct. I’m guessing that in fact the GPU-GPU transfer is translated into a GPU-CPU-GPU, however the event time is totally misleading. Anyways, I’m gonna try what you suggested, however, I’m not really sure what you mean by “performance counter”, could you please clarify a bit more and perhaps put some pseudo-code ?

Thanks.

laughingrice · November 6, 2011, 9:29pm

I’m assuming here that you are using windows.

What I mean is that it’s not simple to do high precision timing. Regular clock functions don’t have the resolution and have some drifts that can cause weird things such as negative times.

What you can do:

LARGE_INTEGER tfreq;
if (!QueryPerformanceFrequency(&tfreq))
	std::cerr << "No support for performance counters\n";
double freq = double(tfreq.QuadPart)/1000.0;

std::cerr << "Clock frequency: " << freq << std::endl;

LARGE_INTEGER start, stop;
QueryPerformanceCounter(&start);

// Your code here

QueryPerformanceCounter(&stop);
std::cout << "Performance counters delta: " << double(stop.QuadPart - start.QuadPart)/freq << std::endl;

lsolano · November 6, 2011, 9:58pm

I’m using linux. I guess QueryPerformanceXXX is valid for windows only (VStudio) ? or am I wrong ?

Thanks.

laughingrice · November 6, 2011, 10:03pm

yes, need to recall how to do the same under linux. Another option is to do the less accurate timing but time multiple iterations to minimize the error

lsolano · November 6, 2011, 10:12pm

wall clock you mean ?

Topic		Replies	Views
Inter-GPU comunication CUDA Programming and Performance	3	11511	May 19, 2011
Timing compares with OpenCL & CUDA CUDA Programming and Performance	1	1000	June 25, 2012
Data transfer between CPU and GPU CUDA Programming and Performance	7	14459	January 30, 2012
Kernel + buffer reads/writes at the same time Asynchronous kernel execution and host-device data tra CUDA Programming and Performance	8	13582	December 22, 2010
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4908	April 23, 2008
CUDA OpenCL comparison CUDA Programming and Performance	9	3511	August 23, 2011
Benchmark kernel execution time with CUDA and OpenCL How to ensure that identical kernels are benchm CUDA Programming and Performance	2	11902	May 4, 2011
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	36116	July 12, 2011
problem in timing of GPU work CUDA Programming and Performance	5	893	September 11, 2015
Getting diff time statistics for same function Totally confused after seeing results CUDA Programming and Performance	3	4258	December 4, 2007

Concurrent Kernel and data transfer on multi-GPU systems

Related topics