Hi there.
I’m trying to evaluate the throughput of computing the same kernel at the same time in two different devices as opposed to the already computed throughput of the kernel in a single device.
On a single device it’s something like
(sort of pseudo-code here)
CUT_DEVICE_INIT(device0);
kernel_function<<grid,threads>>(arguments);
//this is the warmup so no timer starting here
cutStartTimer(timer);
for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);
cutStopTimer(timer);
//evaluate throughput
Problem is that CUT_DEVICE_INIT.
I can make that kernel run on a different device but only after it has ran on the first device.
If I do
CUT_DEVICE_INIT(device0);
CUT_DEVICE_INIT(device1);
I will be running the kernel on the last device defined but the following code won’t help me either:
CUT_DEVICE_INIT(device0);
kernel_function<<grid,threads>>(arguments);
//this is the warmup so no timer starting here
cutStartTimer(timer);
for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);
CUT_DEVICE_INIT(device1);
kernel_function<<grid,threads>>(arguments);
//this is the warmup but timing here exists
for(whichever #iterations)
kernel_function<<grid,threads>>(arguments);
cutStopTimer(timer);
In this case I will not be running the kernel simultaneously on two devices but in series and my willing was to evaluate how would the shared memory and its accesses influence the total throughput and by which factor would it change.
I’m thinking of a Cwise solution, adding a fork to my main function and making the father process command device0 and son process command device1. But again the kernels won’t run simultaneously and again I won’t be evaluating as I wished the throughput resulting.
Any help please? How can I do this CUDA-like?