Hi. I’m having kind of a problem when I compare kernel execution times in CUDA and OpenCL. Based on what I’ve read, CUDA should be a little bit faster than OpenCL. Meanwhile a simple add kernel (add two 1D arrays), runs faster on OpenCL. The code:
CUDA - Kernel
global void addKernel(int *c, const int *a, const int *b, int size)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < size) {
c[i] = a[i] + b[i];
}
}
CUDA - Execution
addKernel <<<62500, 1024 >>>(dev_c, dev_a, dev_b, size);
OpenCL - Kernel
__kernel void Add(__global int *a, __global int *b, __global int *c, int size) {
int i = get_global_id(0);
if (i < size) {
c[i] = a[i] + b[i];
}
}
OpenCL - Execution
vectorSize=64000000; localWorkSize=1024
error = clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, &vectorSize, &localWorkSize, 0, NULL, &event);
Array size in both cases is 64000000. Time is masured by built-in functions of CUDA and OpenCL.
Execution times are: CUDA - 8-9 ms, OpenCL - ~5 ms
GPU is GTX970, CUDA ver 7.5, OpenCL 1.2
Other kernels I’ve tested also run slower on CUDA. This is the simplest code that I’ve tested.
Am I doing something wrong? Or maybe the problem lies elsewhere. Does aynone know why I get such results?