Which is the best way to measure multiGPU code?

I am developing a multiGPU application with more than one streams per GPU.

What is the best way to measure performance in this setup?

What I was thinking is to have a final loop

for(int i=0;i<nGPUs;i++)





and then stop the timer

Is that the best way?

