Timing CUDA program on an HPC cluster

Hello all,

I’m wondering what is the best way to time my program on an HPC cluster, using a single GPU. On my laptop I’ve been timing it as follows:

clock_t start = clock();

clock_t end;

float elapsed;

while (some_condition)
{
    // block of CUDA code
    ...
    end = clock();
    elapsed = (float)(end - start) / CLOCKS_PER_SEC;

    // print elapsed time to file
    ...
}

// one final round of measurement
end = clock();
elapsed = (float)(end - start) / CLOCKS_PER_SEC;

I need to measure the cumulative runtime inside the while loop as well. Would this same method work on a cluster?

Thank you.

It all depends on what it is you want to measure. But in general you would want some sort of high-resolution timer for ease of measurement. clock() is not a timing facility with that property, as CLOCKS_PER_SEC is typically 60 or 100.

If you are after elapsed wall-clock time, and microsecond resolution is sufficient, you could use the following code (based on high-resolution operating system timing facilities) that I have been using for the past twenty years.

#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

Hi yes, I am after elapsed wall clock time, thank you for your code.

I just had one further doubt about your code, namely about gettimeofday. If I am running my code on a cluster, might it not be the case that my code does not finish running in one go but rather gets scheduled to different times of the day, for example? In this case, wouldn’t gettimeofday give erroneous results?

I have no idea how you are running apps in your cluster, and what exactly you are hoping to measure.

Maybe you need to measure how long each program instance executes on a node, because you are interested to find out how much run-time variation there is between nodes, and whether there are patterns of slow and fast nodes. Maybe you simply want to know the end-to-end-time from where the control node starts kicking off the first instance to the completion of the last instance.

I want to measure how long my program takes to run on a single node. Workloads are scheduled using Slurm.

I think this is not a valid concern, most likely. I would encourage you to give the advice a try. While what you are describing is theoretically possible, it requires (for any sensibility at all) a checkpointing process, which is a fairly advanced use case. If none of this makes any sense to you, then for sure you are not using checkpointing, and indeed your program will run on a single node “in one go”, and there is no concern like the one you are imagining.

1 Like

In that case I cannot see a reason why second() wouldn’t work for you. Stick one call at the start of your program, one at the end of your program, compute the difference, and report this result in whatever form is suitable for your purposes and environment. That will result in a collection of as many measurements of elapsed time as there were program instances. You can then process that data to generate whatever statistics you desire.

You can likewise report elapsed time for various parts of the program and report those.

1 Like

I will give the advice a try. My query actually arose from this post, namely the below statements, ignoring that it is about FORTRAN and extending to GPUs:

These two intrinsics report different types of time. system_clock reports “wall time” or elapsed time. cpu_time reports time used by the CPU. On a multi-tasking machine these could be very different.

I hypothesised that using wall clock time would therefore give erroneous results.

Funny you should mention Fortran, because the naming and interface design of my function second() stems from its original use with Fortran code. This was before there were standardized timing functions in Fortran.

Yes, there are differences between measuring wall clock time and CPU time. In extreme cases, the difference can be very large. Just the other day I was looking at a program with about 10 hours of elapsed wall-clock time, but only about 1.5 hours of CPU time. The difference was attributable to heavy I/O activity.

To my knowledge, gettimeofday() returns wall-clock time, and that is why I stated in my initial post “if you are after elapsed wall-clock time, and microsecond resolution is sufficient, you could use the following code”.

Fair warning: Wall-clock time may be subject to discontinuity or non-monotonicity, for example due to the system operator adjusting the system time, or possibly (not sure, I have not tried it) across the start point or end point of daylight saving time. If you expect situations like that to be an issue, you might want to investigate other timing facilities, for example clock_gettime(), which has a different set of caveats.

The following question (with answers :-) on Stackoverflow may be helpful:

How can I measure CPU time and wall clock time on both Linux/Windows?

1 Like

helpful pointers, cheers, i’ll try not to off myself :)