I read many posts regarding performance measurement of kernels/memcpy but none of them about cudaMalloc(…) (or because my search skill is limited). So I wonder how to correctly measure cudaMalloc(…) performance, should I use normal clock() calls or CUDA events?
cudaMalloc() is code that runs on the host, and its performance depends on the operating system and the single-thread performance of the CPU. Like other host activity, it is best measured using a high-resolution system timer such as gettimeofday() on Linux. I have used the code below for the past 15+ years. It provides microsecond granularity.
// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif