Issue with measuring execution time after accelerating with CUDA

Hi,

After accelerating some serial code with CUDA, I’ve tried to measure the execution time difference between the serial code and the parallel code.

The original code is structured as follows.

// original version
{
    clock();
    // serial CPU code
    clock();
    // serial CPU code (OpenCV API calls)
    clock();
}

I’ve changed the above code as follows. I’ve only modified the first part, leaving the OpenCV part unchanged.

// modified version
{
    clock();
    // CUDA code (*** changed ***)
    cudaDeviceSynchronize();
    clock();
    // serial CPU code (OpenCV API calls) (*** unchanged ***)
    clock();
}

Strangely, while getting some speed gain on the changed region and also getting some speed gain on the overall execution time, the unchanged serial CPU code region takes about 50 times more to compute. What may be the reason for this behavior? Also, what’s the best practice to measure performance boost when accelerating some serial CPU code with CUDA?

Update:

Just for a reference for someone who might come across this post, I solved this issue by measuring the wall clock time with C++ chrono library. I was using Linux and the Linux clock() API function measures processor time, so it should be avoided to measure the performance in this case.