Problem with vector addition example The program doesn't work the way described

Dear all. I just started with OpenCL and I encountered some weird behavior with the

vector addition example. So the source for the kernel which should implement vector addition looks like

const char *source = "__kernel void vectorAdd(__global const float* a, __global const float* b, \

		__global float* c)			\n\

{ 									\n\

	int n = get_global_id(0);		\n\

	c[n] = a[n] + b[n];					\n\

}";

But when I run the program with this code I get “Invalid command queue” error when

I try to obtain the vector c through clEnqueueReadBuffer() call. Some posts on Internet

suggested that this error happens when the kernel execution failed for some reason in runtime

(like accessing an element which is out of boundary for given array).

So I modified the example kernel to

const char *source = "__kernel void vectorAdd(__global const float* a, __global const float* b, \

		__global float* c)			\n\

{ 									\n\

	int n = get_global_id(0) % 96;		\n\

	c[n] = a[n] + b[n];					\n\

}";

96 is the the value for the global_work_size, argument of clEnqueueNDRangeKernel()

status = clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, NULL, (const size_t*) &cnDimension, 0, 0, 0, 0);

// cnDimension is initialized as 96

cnDimension is also the number of elements in the arrays a,b,c

Now the kernel really works as expected- it adds the two vectors a and b and stores the result into c.

To me it seems that actually the framework starts bigger number of threads for this kernel that’s why some of them

have

global_id >= cnDimension

which causes the problem (presumably out of boundaries acccess).

So my question is am I doing something wrong or the framework really starts bigger number of threads for this kernel?