Run same kernel on two devices simultaneously

Hi there.

I’m trying to evaluate the throughput of computing the same kernel at the same time in two different devices as opposed to the already computed throughput of the kernel in a single device.

On a single device it’s something like

(sort of pseudo-code here)
CUT_DEVICE_INIT(device0);
kernel_function<<grid,threads>>(arguments);
//this is the warmup so no timer starting here

cutStartTimer(timer);

for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);

cutStopTimer(timer);

//evaluate throughput

Problem is that CUT_DEVICE_INIT.

I can make that kernel run on a different device but only after it has ran on the first device.

If I do

CUT_DEVICE_INIT(device0);
CUT_DEVICE_INIT(device1);

I will be running the kernel on the last device defined but the following code won’t help me either:

CUT_DEVICE_INIT(device0);
kernel_function<<grid,threads>>(arguments);
//this is the warmup so no timer starting here

cutStartTimer(timer);

for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);

CUT_DEVICE_INIT(device1);
kernel_function<<grid,threads>>(arguments);
//this is the warmup but timing here exists

for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);
cutStopTimer(timer);

In this case I will not be running the kernel simultaneously on two devices but in series and my willing was to evaluate how would the shared memory and its accesses influence the total throughput and by which factor would it change.

I’m thinking of a Cwise solution, adding a fork to my main function and making the father process command device0 and son process command device1. But again the kernels won’t run simultaneously and again I won’t be evaluating as I wished the throughput resulting.

Any help please? How can I do this CUDA-like?

You should do two things.

  1. Look at cutil.h, and in particular the definition of CUT_DEVICE_INIT. It should become obvious why what you are trying to do can’t work
  2. Look at the SDK example simpleMultiGPU, which contains a working example of how to manage two contexts from a single host thread using the runtime API.

+1 avidday

Anyway, you will just measure the either the slowest device transfer+execution time, or the sum of transfert time + slowest device execution time. I don’t see the point.

If you want load-balance your kernel between devices, you have to bench them separately, and then allocate work for each device. Or, eventually, dynamically send new kernels to each device when first kernels finish execution.