Timing in OpenCL vs. CUDA How to verify that I'm measuring the same thing in OpenCL and CUDA?

Hi everyone,

I’ve been attempting to create some microbenchmarks just to get used to using OpenCL (I’ve used CUDA before, but I’m new to OpenCL). However, it seems like I’ve not quite been able to figure out a way to accurately measure the timing (of a kernel, for example) on an apples-to-apples basis with CUDA, and I can’t seem to find any previous posts about this, so I’m posting this here to see if someone(s) can clarify this for me. The example I’m going to use here is just for an empty kernel (i.e. it just returns immediately).

My question is how to measure the “same thing” in OpenCL and CUDA? For example, if I write this empty kernel in CUDA it would be (ignoring error checking, etc., which I am doing in my actual code):

// CUDA version

__global__ void empty()




// measures the kernel execution time

void measure_empty()


  cudaEvent_t start, stop;	

  float elapsedTime = 0.0f;

  cudaError_t cudaErr;

... // initialization

dim3 Db = dim3(512, 1, 1);

  dim3 Dg = dim3(16,16,16);

// start timing

  cudaErr = cudaEventRecord(start, 0);

empty<<<Dg, Db>>>();


// stop timing: stop - start = elapsedTime

  cudaErr = cudaEventRecord(stop, 0);

  cudaErr = cudaEventSynchronize(stop);

  cudaErr = cudaEventElapsedTime(&elapsedTime, start, stop);


To write the same (or what I believe to be the same) code in OpenCL, I did this (again, ignore online building, creating the command queue, etc., which I am doing but am not showing to save space):

// global variables

cl_command_queue commandQueue;

cl_kernel empty;

__kernel void empty()




void measure_empty()


  cl_int clErr = CL_SUCCESS;

  cl_event GPUExecution;

size_t localWorkSize[3] = {512,1,1};

  size_t globalWorkSize[3] = {8192,16,16};

// call kernel, flush so it executes without waiting for more commands to be enqueued

  clErr = clEnqueueNDRangeKernel(commandQueue, empty, 3, 0, globalWorkSize, localWorkSize, 0, NULL, &GPUExecution);

  clErr = clFlush(commandQueue);

// synchronize on return

  clErr = clFinish(commandQueue);

// get timing

  cl_ulong start, end;

  clErr = clGetEventProfilingInfo(&GPUExecution, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);

  clErr = clGetEventProfilingInfo(&GPUExecution, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);

  double elapsedTime = (double)1.0e-9 * (end - start);


So I’ve run both of these, and I don’t seem to be able to get the same timing (by same timing, I mean something that makes sense / is reasonable). Here’s what I’m getting:

CUDA: Kernel Processing Time: 0.094912 ms

OpenCL: Kernel proccesing time: 0.00010 s

Any ideas what I should be doing differently? One thought I had is getting the additional clGetEventProfilingInfo timing for how long it’s enqueued before starting, but I’m not sure this is the right way to go. Any help would be greatly appreciated!



So it would appear that maybe my problem is unit conversion. Doh!

My overall question still remains though, is this the “correct” way to be measuring logically equivalent code?