Measure execution time Multiple GPU on CUDA 4.0 with cudaEvents

Yello!

Take this as an example, if you may:

for(int i=0;i<GPU_N;i++){

    //gain control of device i

    cudaSetDevice(i);

    //assynchronously copy data to device

    cudaMemcpyAsync(h_data[i],d_data[i],N*sizeof(/*var type*/),cudaMemcpyHostToDevice,stream[i]);

    //assynchronously launch kernel

    some_kernel<<<blocks,threads,/*some shared memory amount*/,stream[i]>>>(d_data[i], /*more arguments as fit*/);

    //assynchronously copy data from device

    cudaMemcpyAsync(d_data[i],h_data[i],N*sizeof(/*var type*/),cudaMemcpyDeviceToHost,stream[i]);

}

//wait for devices to finish

for(int i=0;i<GPU_N;i++){

    //gain control of device i

    cudaSetDevice(i);

    cudaStreamSynchronize(stream[i]);

}

When I add some cudaEvents to it:

cudaEventRecord(start,0);  /*this goes just before the code shown in the snippet above*/

cudaEventRecord(stop,0);   /*this goes after the code snippet shown in the snippet above*/

cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsed_time,start,stop);

printf("Execution time : %d \n",elapsed_time);

The last line, the printf, outputs 0, ie elapsed_time is zero.

I don’t know if I’m right, but it seems the event start is being issued to the first device and then the event stop is issued to the last device, hence making no sense whatsoever.

Either way it doesn’t work. So, how can I measure execution time using cudaEvents?

Will I have to rely on a CPU timer before the first cudaSetDevice and after the last cudaStreamSynchronize?

Thank you for your time.

In the programming guide it reads cudaEventElapsedTime() fails if the two cudaEvents passed as arguments lie on different devices. Which explains why I get zero for the elapsed time.

If I define different start and stop cudaEvents for each device and add them I’ll get the total execution time plus the overlap between devices. The overlap is undefined and elapsed time will always evaluate to more than the execution time. How can I use the devices counters to provide for the accurate execution time?

CPU timers and CPU timers only will output the execution time I want?