Timing CUDA program on an HPC cluster

al0vya · March 16, 2021, 8:41pm

Hello all,

I’m wondering what is the best way to time my program on an HPC cluster, using a single GPU. On my laptop I’ve been timing it as follows:

clock_t start = clock();

clock_t end;

float elapsed;

while (some_condition)
{
    // block of CUDA code
    ...
    end = clock();
    elapsed = (float)(end - start) / CLOCKS_PER_SEC;

    // print elapsed time to file
    ...
}

// one final round of measurement
end = clock();
elapsed = (float)(end - start) / CLOCKS_PER_SEC;

I need to measure the cumulative runtime inside the while loop as well. Would this same method work on a cluster?

Thank you.

njuffa · March 16, 2021, 11:25pm

It all depends on what it is you want to measure. But in general you would want some sort of high-resolution timer for ease of measurement. clock() is not a timing facility with that property, as CLOCKS_PER_SEC is typically 60 or 100.

If you are after elapsed wall-clock time, and microsecond resolution is sufficient, you could use the following code (based on high-resolution operating system timing facilities) that I have been using for the past twenty years.

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

al0vya · March 17, 2021, 12:00am

Hi yes, I am after elapsed wall clock time, thank you for your code.

I just had one further doubt about your code, namely about gettimeofday. If I am running my code on a cluster, might it not be the case that my code does not finish running in one go but rather gets scheduled to different times of the day, for example? In this case, wouldn’t gettimeofday give erroneous results?

njuffa · March 17, 2021, 12:08am

I have no idea how you are running apps in your cluster, and what exactly you are hoping to measure.

Maybe you need to measure how long each program instance executes on a node, because you are interested to find out how much run-time variation there is between nodes, and whether there are patterns of slow and fast nodes. Maybe you simply want to know the end-to-end-time from where the control node starts kicking off the first instance to the completion of the last instance.

al0vya · March 17, 2021, 12:12am

I want to measure how long my program takes to run on a single node. Workloads are scheduled using Slurm.

Robert_Crovella · March 17, 2021, 12:16am

I think this is not a valid concern, most likely. I would encourage you to give the advice a try. While what you are describing is theoretically possible, it requires (for any sensibility at all) a checkpointing process, which is a fairly advanced use case. If none of this makes any sense to you, then for sure you are not using checkpointing, and indeed your program will run on a single node “in one go”, and there is no concern like the one you are imagining.

njuffa · March 17, 2021, 12:18am

In that case I cannot see a reason why second() wouldn’t work for you. Stick one call at the start of your program, one at the end of your program, compute the difference, and report this result in whatever form is suitable for your purposes and environment. That will result in a collection of as many measurements of elapsed time as there were program instances. You can then process that data to generate whatever statistics you desire.

You can likewise report elapsed time for various parts of the program and report those.

al0vya · March 17, 2021, 12:21am

I will give the advice a try. My query actually arose from this post, namely the below statements, ignoring that it is about FORTRAN and extending to GPUs:

These two intrinsics report different types of time. system_clock reports “wall time” or elapsed time. cpu_time reports time used by the CPU. On a multi-tasking machine these could be very different.

I hypothesised that using wall clock time would therefore give erroneous results.

njuffa · March 17, 2021, 12:35am

Funny you should mention Fortran, because the naming and interface design of my function second() stems from its original use with Fortran code. This was before there were standardized timing functions in Fortran.

Yes, there are differences between measuring wall clock time and CPU time. In extreme cases, the difference can be very large. Just the other day I was looking at a program with about 10 hours of elapsed wall-clock time, but only about 1.5 hours of CPU time. The difference was attributable to heavy I/O activity.

To my knowledge, gettimeofday() returns wall-clock time, and that is why I stated in my initial post “if you are after elapsed wall-clock time, and microsecond resolution is sufficient, you could use the following code”.

Fair warning: Wall-clock time may be subject to discontinuity or non-monotonicity, for example due to the system operator adjusting the system time, or possibly (not sure, I have not tried it) across the start point or end point of daylight saving time. If you expect situations like that to be an issue, you might want to investigate other timing facilities, for example clock_gettime(), which has a different set of caveats.

The following question (with answers :-) on Stackoverflow may be helpful:

How can I measure CPU time and wall clock time on both Linux/Windows?

al0vya · March 17, 2021, 1:05am

helpful pointers, cheers, i’ll try not to off myself :)

Topic		Replies	Views
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	6012	September 8, 2009
Number of GPU clock cycles CUDA Programming and Performance	15	10370	June 16, 2017
CUDA OpenCL comparison CUDA Programming and Performance	9	3403	August 23, 2011
equivalent to CUDA events cpu time fucntions CUDA Programming and Performance	7	1852	March 30, 2014
Most accurate method of timing CUDA kernels and related memory operations CUDA Programming and Performance	5	3107	January 3, 2016
Timing cuda code I'm sorry for small for dÃ©ja-vu :-) CUDA Programming and Performance	12	35988	July 12, 2011
Compare GPU and CPU function time CUDA Programming and Performance	7	6307	May 30, 2011
SPMT: Single Program Multiple (Exeuction) Time CUDA Programming and Performance	15	3902	July 4, 2009
Timing cudaEventRecord() ok for cpu timing? CUDA Programming and Performance	2	7632	August 14, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13444	July 9, 2008

Timing CUDA program on an HPC cluster

How can I measure CPU time and wall clock time on both Linux/Windows?

Related topics