I am developing a multiGPU application with more than one streams per GPU.
What is the best way to measure performance in this setup?
What I was thinking is to have a final loop
for(int i=0;i<nGPUs;i++)
{
cudaSetDevice(i);
cudaDeviceSynchronize();
}
and then stop the timer
Is that the best way?
Thank you in advance