Hi everyone,
I’ve been attempting to create some microbenchmarks just to get used to using OpenCL (I’ve used CUDA before, but I’m new to OpenCL). However, it seems like I’ve not quite been able to figure out a way to accurately measure the timing (of a kernel, for example) on an apples-to-apples basis with CUDA, and I can’t seem to find any previous posts about this, so I’m posting this here to see if someone(s) can clarify this for me. The example I’m going to use here is just for an empty kernel (i.e. it just returns immediately).
My question is how to measure the “same thing” in OpenCL and CUDA? For example, if I write this empty kernel in CUDA it would be (ignoring error checking, etc., which I am doing in my actual code):
// CUDA version
__global__ void empty()
{
return;
}
// measures the kernel execution time
void measure_empty()
{
cudaEvent_t start, stop;
float elapsedTime = 0.0f;
cudaError_t cudaErr;
... // initialization
dim3 Db = dim3(512, 1, 1);
dim3 Dg = dim3(16,16,16);
// start timing
cudaErr = cudaEventRecord(start, 0);
empty<<<Dg, Db>>>();
cudaThreadSynchronize();
// stop timing: stop - start = elapsedTime
cudaErr = cudaEventRecord(stop, 0);
cudaErr = cudaEventSynchronize(stop);
cudaErr = cudaEventElapsedTime(&elapsedTime, start, stop);
}
To write the same (or what I believe to be the same) code in OpenCL, I did this (again, ignore online building, creating the command queue, etc., which I am doing but am not showing to save space):
// global variables
cl_command_queue commandQueue;
cl_kernel empty;
__kernel void empty()
{
return;
}
void measure_empty()
{
cl_int clErr = CL_SUCCESS;
cl_event GPUExecution;
size_t localWorkSize[3] = {512,1,1};
size_t globalWorkSize[3] = {8192,16,16};
// call kernel, flush so it executes without waiting for more commands to be enqueued
clErr = clEnqueueNDRangeKernel(commandQueue, empty, 3, 0, globalWorkSize, localWorkSize, 0, NULL, &GPUExecution);
clErr = clFlush(commandQueue);
// synchronize on return
clErr = clFinish(commandQueue);
// get timing
cl_ulong start, end;
clErr = clGetEventProfilingInfo(&GPUExecution, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
clErr = clGetEventProfilingInfo(&GPUExecution, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
double elapsedTime = (double)1.0e-9 * (end - start);
}
So I’ve run both of these, and I don’t seem to be able to get the same timing (by same timing, I mean something that makes sense / is reasonable). Here’s what I’m getting:
CUDA: Kernel Processing Time: 0.094912 ms
OpenCL: Kernel proccesing time: 0.00010 s
Any ideas what I should be doing differently? One thought I had is getting the additional clGetEventProfilingInfo timing for how long it’s enqueued before starting, but I’m not sure this is the right way to go. Any help would be greatly appreciated!
Thanks,
Matt