I am having trouble launching multiple threads per block

Every time I try to, clEnqueueNDRangeKernel does not return CL_SUCCESS and I do not get the right output in the end.

I have worked with CUDA before, and I am familiar with how to setup block sizes and grid sizes. Here is my relevant openCL code:

size_t localWorkSize[1] = {128}; // one dimensional Range

	size_t globalWorkSize[1] = {(SIZE/localWorkSize[0] + 1)}; // one dimensional Range

	resultCL = clEnqueueNDRangeKernel(GPUCommandQueue, OpenCLConvolution, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);

	if (resultCL != CL_SUCCESS)

	{

		throw(std::string("CallKernel()::Error: Enqueueing kernel onto command queue. (clEnqueueNDRangeKernel)"));

	}

It works if I change my globalWorkSize to just SIZE amount, and then NULL for localWorkSize, but it is predictably slow. Is this a common problem? Am I forgetting something or making a mistake?

Also, below is my kernel:

"__kernel void ConvolutionGPU(__global float* outputSignalArray, __global float* inputSignalArray,__global float* responseSignalArray, __global int* length)",

		"{",

		" unsigned int n = get_global_id(0);",

		" if(n < *length)",

		" { ",

		" float accumulator = 0.0f;",

		"  for(int j = 0; j < *length; j++)",

		"  {",

		"   accumulator += inputSignalArray[j] * responseSignalArray[(j+n) % (*length)];",

		"  }",

		" outputSignalArray[n] = accumulator;",

		" }",

		"}"

localWorkGroup must divide globalWorkGroup equally.

Thanks, I must have missed that detail when reading about openCL.