Every time I try to, clEnqueueNDRangeKernel does not return CL_SUCCESS and I do not get the right output in the end.
I have worked with CUDA before, and I am familiar with how to setup block sizes and grid sizes. Here is my relevant openCL code:
size_t localWorkSize[1] = {128}; // one dimensional Range
size_t globalWorkSize[1] = {(SIZE/localWorkSize[0] + 1)}; // one dimensional Range
resultCL = clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLConvolution, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
if (resultCL != CL_SUCCESS)
{
throw(std::string("CallKernel()::Error: Enqueueing kernel onto command queue. (clEnqueueNDRangeKernel)"));
}
It works if I change my globalWorkSize to just SIZE amount, and then NULL for localWorkSize, but it is predictably slow. Is this a common problem? Am I forgetting something or making a mistake?
Also, below is my kernel:
"__kernel void ConvolutionGPU(__global float* outputSignalArray, __global float* inputSignalArray,__global float* responseSignalArray, __global int* length)",
"{",
" unsigned int n = get_global_id(0);",
" if(n < *length)",
" { ",
" float accumulator = 0.0f;",
" for(int j = 0; j < *length; j++)",
" {",
" accumulator += inputSignalArray[j] * responseSignalArray[(j+n) % (*length)];",
" }",
" outputSignalArray[n] = accumulator;",
" }",
"}"