How to correctly measure performance of cudaMalloc(...)?

Hi,

I read many posts regarding performance measurement of kernels/memcpy but none of them about cudaMalloc(…) (or because my search skill is limited). So I wonder how to correctly measure cudaMalloc(…) performance, should I use normal clock() calls or CUDA events?

Thanks,
Ngoc

cudaMalloc() is code that runs on the host, and its performance depends on the operating system and the single-thread performance of the CPU. Like other host activity, it is best measured using a high-resolution system timer such as gettimeofday() on Linux. I have used the code below for the past 15+ years. It provides microsecond granularity.

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

Thanks njuffa, it works pretty well in my code. Actually your code helps me verify the timing result of cudaMalloc(…) in my code in different blog post https://devtalk.nvidia.com/default/topic/831150/cuda-programming-and-performance/titan-x-with-latest-drivers-slower-than-titan-black-with-older-drivers/2/?offset=24#4676896