How to correctly measure performance of cudaMalloc(...)?


I read many posts regarding performance measurement of kernels/memcpy but none of them about cudaMalloc(…) (or because my search skill is limited). So I wonder how to correctly measure cudaMalloc(…) performance, should I use normal clock() calls or CUDA events?


cudaMalloc() is code that runs on the host, and its performance depends on the operating system and the single-thread performance of the CPU. Like other host activity, it is best measured using a high-resolution system timer such as gettimeofday() on Linux. I have used the code below for the past 15+ years. It provides microsecond granularity.

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#include <windows.h>
double second (void)
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
#error unsupported platform

Thanks njuffa, it works pretty well in my code. Actually your code helps me verify the timing result of cudaMalloc(…) in my code in different blog post