equivalent to CUDA events cpu time fucntions

dksellou · March 26, 2014, 9:43pm

Hey,
I am wondering what would be equivalent to Cuda’s start and end time functions?
gettickcount()? clock()?

Does CUDA events functions measure performance in milliseconds?
Please advice and thanks

P.S.

njuffa · March 27, 2014, 3:58am

I am not entirely sure what you are asking. If you are looking for a high-precision timer to time your host code, you may want to give the following code a try that I have been using for well over a decade.

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
static double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() / 1000.0;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
static double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec / 1000000.0;
}
#else
#error unsupported platform
#endif

dksellou · March 27, 2014, 4:12pm

Hello and thanks for your time and help.
I want to time only individual functions in my Host code and then compare them with GPU equivalent functions.
I am trying to find a simple function that measures the elapsed time of my function in milliseconds
The above code is too complicated for me!
:(
:)

njuffa · March 27, 2014, 4:48pm

If you want to time host function execution with a high-resolution timer, the code I posted should work well for you because that is exactly what I wrote it and am using it for. You do not need to worry about the implementation details. The code looks a little obscure because it has conditional code branches for Linux, Windows, and Mac OS X which are the OS platforms supported by CUDA. Simply include the snippet at the start of your code and then call second(), like so:

double start, stop, elapsed;
start = second();
[ .... code under test ...]
stop = second();
elapsed = stop - start; // execution time in second, with microsecond resolution

Make sure you warm up caches etc, when you time host code. You would not want to time the first execution of a piece of code.

dksellou · March 27, 2014, 5:25pm

Perfect!
I will use your advice.

However I used a different approach and I want you to tell me if it makes sense

double diffclock(clock_t clock1,clock_t clock2)
{ double diffticks=clock1-clock2;
double diffms=(diffticks*1000)/CLOCKS_PER_SEC;
return diffms; }

//and then :
clock_t begin=clock();

//function to measure goes here
clock_t end=clock();
cout << “Time elapsed: " << double(diffclock(end,begin)) << " ms”<< endl;
return 0; }

//does it make sense?

njuffa · March 27, 2014, 7:09pm

As far as I recall, clock() provides a low-resolution clock, where the resolution is 1/60 or 1/100 of a second. This resolution is too low for accurately timing individual functions on modern CPUs, unless they happen to be very long-running functions.

SPWorley · March 30, 2014, 12:39am

Norbert, your Linux/OSX timing harness can be slightly improved in accuracy. The gettimeofday() function is returning the number of seconds and microseconds since the 1970 start epoch. This means that a measurement now, 44 years later, has a microsecond counter value of about 44 years * 365 days * 24 hours * 3600 seconds *1000000 microseconds, which is about 2^50.4. This is close enough to the 53 effective bits of mantissa in a double that you get precision limited representation errors when differencing the start and end times with very short (on the order of a microsecond) intervals.

A quick fix is to offset the epoch to effectively make it zero centered in say year 2015. This leaves enough mantissa bits for another decade or so.
Just update the line 32 in your code to be:

const time_t Epoch2015 = 45UL*365*24*3600; 
  return (double)(tv.tv_sec-Epoch2015) + (double)tv.tv_usec / 1000000.0;

Looking a bit deeper, the real problem is that while the gettimeofday() call is returning an integer number of microseconds, that count is being divided by 1000000.0, which makes it not exactly representable in floating point. This is usually ignorable except we only have a few bits of mantissa left because of that 50+ bit epoch offset. So an alternative and perhaps superior fix would be to store the number of microseconds, not seconds, and difference those and do the 1000000.0 division on the difference.

Yet another layer of improvement could come by using the finer grained clock in Linux:

double newsecond()
{
  timespec ts;
  clock_gettime(CLOCK_MONOTONIC_RAW, &ts);
  return (double)ts.tv_sec + (double)ts.tv_nsec/1000000000.0;
}

This timer uses an epoch based on the machine boot time, and returns nanoseconds, not microseconds. In practice in x86-64 Linux the timer quantum seems to be about 0.05 us, giving a lot finer results as well. clock_gettime() is a bit more annoying in that it needs the librt library to be linked and is not available on all flavors of UNIX. gettimeofday() is POSIX so it’s extremely portable.

njuffa · March 30, 2014, 9:07am

Those are all excellent points one may want to consider if more than microsecond resolution is needed. Not sure whether this is practical. I find that one usually comes up against various sources of noise in modern systems, such as the memory hierarchy. Sub-microsecond timing would also require calibration to subtract out the overhead of the OS functions used to report the time.

When I started using the posted code a long time ago, I convinced myself that microsecond resolution could be maintained within “double” data until Unix time stamps roll over in 2038, requiring no more than 51 bits and leaving leaving two bits to represent fractional (quarter) microseconds. Thinking about it now, I am not sure whether there actually is a connection between the “Year 2038” problem and the gettimeofday() function; it has been too long a time since I looked at this functionality.

Topic		Replies	Views
Number of GPU clock cycles CUDA Programming and Performance	15	10372	June 16, 2017
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35988	July 12, 2011
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6015	September 8, 2009
Most accurate method of timing CUDA kernels and related memory operations CUDA Programming and Performance	5	3107	January 3, 2016
clock() doesn't work properly CUDA Programming and Performance	10	6295	July 3, 2009
How to get exact measurement of CPU and GPU running time? CUDA Programming and Performance cuda	2	1779	August 12, 2023
Timing CUDA program on an HPC cluster CUDA Programming and Performance	10	1058	October 12, 2021
How to correctly measure performance of cudaMalloc(...)? CUDA Programming and Performance	2	674	September 22, 2015
time measurement discrepancy timer, clock(), profiling CUDA Programming and Performance	4	6697	April 7, 2010
timing kernel execution with clock() CUDA Programming and Performance	6	3734	July 6, 2009

equivalent to CUDA events cpu time fucntions

Related topics