Timing cuda code I'm sorry for small for déja-vu :-)

Hello scientists :-),
I have a short question related to timing CUDA code. I’ve read a lot of topics about timing on this forum, two of them are here (just for reference):
Link1
Link2
But in fact, I was not able to find definitive answer to following question:
Is it better to use host-based timing with cudaThreadSynchronize() or CUDA events are preferred?
My background:

  • single device - 8800GTX
  • step by step operations directed to STREAM 0, no concurrency, no multiple streams
  • my task is to compare CPU vs. GPU timings of some equivalent pieces of code
  • I’m timing single kernel executions as well as whole batches of successive kernel launches.

Which of two mentioned timing methods would you advice to me and why? I’m using the first one actually (cudaThreadSynchronize and host-based timing using clock_gettime(CLOCK_REALTIME, timer) on Linux based machines). I would appreciate to view your advice or own experience. Thank you.
Jan

Events if you want to measure GPU time.

Paulius

Well, but why do you prefer events and what is the difference? Is it possible to say? Furthermore, member with nickname nwilt noted in both mentioned threads something like this:

But this sentence doesn’t clarify things to me :-(

Events precisely measure the actual time taken on the GPU for a kernel (or series of kernels/memcpys) to run. Because the use a high precision timer, you can measure the time of just one kernel call. But, depending on what you are doing, the time measured by the events my not be an accurate measurement of what you want (more in a bit)

Wall clock time measurements are not as precise, you will need to time several hundred or thousand kernel calls in a row to build up a precise average time per call. But, you are accurately measuring the actual time it takes to perform the task including driver overhead and any CPU computations that may be part of your algorithm.

So, the two have different uses. I have a bit of code that runs for a little while on the CPU (5% of the time) and then uses the calculated result from the CPU to launch a GPU kernel which runs for the other 95% of the time.

  1. I can use events to measure the time of execution of the kernel on the GPU. Uses: With a precise measurement of the actual kernel execution time, I can compute the effective GFLOP/s and GiB/s rates to see if I am pushing the hardware limits.

But, if what I want is to compare to a CPU code, the time I’m measuring using events is only 95% of the total wall clock time from start to finish!

So: 2) I can use the wall clock time to measure the total time from start to finish of the algorithm to compare to a CPU implementation of the same code.

Of course, you can always use the profiler tool to time kernel calls to give you the same information as events. But events are a little more flexible than the profiler in that they can be inserted in streams and time a series of multiple kernels/memcpys.

Thank you MisterAnderson, this is a hell good explanation. And there are some very new pieces of information for me (I didn’t know that CUDA profiler gives the same result as events for instance). Thx once more…

Jan

I am also interested in measuring execution times, but my code runs on separate streams. For example I could have call sequence A on one stream and call sequence B on another. From what I understand, cudaEvents could only measure the time taken for both calls to complete since for timing measurements, events need to be on stream 0. Is this correct?

Is there some clever way of placing an event on non-0 streams and triggering some timing measurement when each call sequence (A or B) completes?

Thank you for your help.

skb

The cudaEventRecord() function takes as second argument the CUDA stream:

cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream);

Hello Charley,

Even though cudaEventRecord() can take a stream argument, at least according to the documentation, cudaEventElapsedTime() returns an undefined value if either of the events is in the non-0 stream.

Can you tell me what your proposal was for using events?

Thank you,

skb

Did you mean to address that to seb?

Yes, sorry, it was for seb. :">

Let’s try that again …

Hello seb,

Even though cudaEventRecord() can take a stream argument, at least according to the documentation, cudaEventElapsedTime() returns an undefined value if either of the events is in the non-0 stream.

Can you tell me what your proposal was for using events?

Thank you,

skb

Acording to this work http://www.scribd.com/doc/38986134/B-tech-Thesis
I used clock() to measure performance on CUDA simple vector multiplication and for 114440 elements (and other 1024, clock ()- 98 ms and events- 0,014 ms ) i got time for executing kernel 1017,11 ms, and using cuda events 8 ms maybe
example using events:
cudaEventRecord(start, 0);

    kernel<<<((N)/256 +1),256>>>(inputa, inputb, outputc);

cudaThreadSynchronize();
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time,start,stop);
avgKernel+=time;

Does anyone have any idea why?

For host-based timing clock() doesn’t provide the fine granularity needed to accurate time short running CUDA kernels. For the platforms I know clock() typucally has a resolution of 1/100 of a second. You would want to use a high-precision timer on the hots to be able to reconcile host-side and GPU-side time measurements. Below is some high-resolution timer code for host-side measurements that I have been using for years and which has served me well. Please note that due to driver overhead your host-side measurements will typically show higher runtimes than GPU-side timing with cudaEvents [the latter just measures the actual kernel runtime without any overheads whatsoever].

#if defined(_WIN32)

#if !defined(WIN32_LEAN_AND_MEAN)

#define WIN32_LEAN_AND_MEAN

#endif

#include <windows.h>

static double second (void)

{

    LARGE_INTEGER t;

    static double oofreq;

    static int checkedForHighResTimer;

    static BOOL hasHighResTimer;

if (!checkedForHighResTimer) {

        hasHighResTimer = QueryPerformanceFrequency (&t);

        oofreq = 1.0 / (double)t.QuadPart;

        checkedForHighResTimer = 1;

    }

    if (hasHighResTimer) {

        QueryPerformanceCounter (&t);

        return (double)t.QuadPart * oofreq;

    } else {

        return (double)GetTickCount() / 1000.0;

    }

}

#elif defined(__linux__) || defined(__APPLE__)

#include <stddef.h>

#include <sys/time.h>

static double second (void)

{

    struct timeval tv;

    gettimeofday(&tv, NULL);

    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;

}

#else

#error unsupported platform

#endif

Thnx, I’m sure that problem is in resolution and this is why there is a big difference in execution times, I’ll try with this.

I’m new in this, and also confused because I have implemented cuda events and opencl events to measure CPU-GPU, GPU-CPU copy and kernel execution times. The thing that bugs me the most is that my opencl implementation shows better results than cuda implementation.

For example (using events from CUDa and OpenCL documentation)
openCL:

cl_ulong start, end; clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
float executionTimeInMilliseconds = (end - start) * 1.0e-6f;

CUDA:
cudaEvent_t start, stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y, NUM_REPS);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );

for number of elements 2048
CUDA CPU-GPU 0,0165979 ms, GPU-CPU 0,091427 ms, Kernel - 0,007098 and
OpenCL CPU-GPU 0,007276 ms, GPU-CPU 0,006684 ms, Kernel - 0,011754.

I tried with bigger number of elements like 114440, 2097152 etc. and opencl still shows better performance.
Literature and articles all say that CUDA offers better performance, so I’m thinking that I’m doing something wrong, what should i check?
Already checked syncronization, calculated average values… changed kernel execution settings…