Which is the best way to measure multiGPU code?

I am developing a multiGPU application with more than one streams per GPU.

What is the best way to measure performance in this setup?

What I was thinking is to have a final loop

for(int i=0;i<nGPUs;i++)

{

cudaSetDevice(i);

cudaDeviceSynchronize();

}

and then stop the timer

Is that the best way?

Thank you in advance