High CPU load and bad clock() return values


Lot’s of articles explain that during GPU operation, CPU load on one thread is high (e.g. 100%) due spin-wait poll. The following command should reduce this:


This works fine, but I use clock() inside my host code to calculate performance once kernel is finished. clock() however returns erratic (way too low) ticks.

Why is this?

I don’t see why reducing poll frequency CPU/GPU interferes with the ticks of the CPU.

I generally don’t use clock() as in my experience it has a number of issues, such as too-coarse granularity and differing behavior on differing OS. The reason that the CPU load drops in your case is because under the hood, using cudaSetDeviceFlags can allow CPU threads to “yield” to other OS threads. I wouldn’t be surprised that this would affect thread based timing, because the thread, in fact, has entered a kind of sleep state (therefore the number of clock ticks witnessed by that thread are reduced), and this behavior affects other thread-based monitoring such as linux time command in my experience (user time, not wall time). Stated another way, the amount of time a thread is “active” is reduced when it enters a yield status, as compared to wall clock time. My suggestion would be to use another timer, such as std::chrono or a high-resolution timer on windows, or a HRT on linux.

referring to the linux man page for clock():

" The clock() function returns an approximation of processor time used
by the program."

processor time used by the program is not the same as wall clock, or elapsed time. when a thread is sleeping, or in a yielded state, it is not using processor time, even though wall clock time continues to advance. So if you are really after wall clock/elapsed/duration time measurements, use of the clock() API is probably a mistake.

1 Like

clock() wasn’t the proper choice.
After digging further, I decided to use this:

#include <sys/time.h>
long timevalms() {
    struct timeval tv;
    long ms;
    gettimeofday(&tv, NULL);
    ms = tv.tv_sec * 1000 + tv.tv_usec / 1000;
    return ms;

This works perfectly now.


I have been using the following cross-platform code for the past 15+ years. Since for reasons of convenience it uses a double to represent the number of seconds, it offers slightly better than microsecond resolution beyond the time at which the 32-bit Unix timestamp will overflow on January 19, 2038. Higher resolution is rarely needed in my experience.

// Microsecond resolution
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform
1 Like