CL_INVALID_COMMAND_QUEUE, possible Nvidia bug? Writing to a single __local uint variable causes erro

(Update: See next post)

I’ve been hunting desperately something which I first assumed to be a bug in my code, but which now seems to possibly be one in Nvidia’s driver (or somewhere else).

What my kernel is basically doing is to perform a certain image analysis of static images (part of a eye recognition software). There are different (test) images, and there’re parameters that can be changed which can create different results, but the error occurs without changing either of these (read: doing the very same thing multiple times fails at some point).

After executing the kernel via clEnqueueNDRangeKernel, the next call to a OpenCL function fails with a CL_INVALID_COMMAND_QUEUE error. What causes the error is a write operation to a single uint in local memory by a single thread. The error is reproducible more or less, that is, given the conditions for its occurance, it will occur within a few iterations (sometimes a few hundred, but most of the time only tens of them). The problem is that the input data does not change (to be more correct: the error occurs more or less independently from the input data; it does not occur with certain input data, but will very sure show up for other cases).

I was able to find out the exact operation that causes the error; without it or done in another way, there seems to be no error. See the code:

// helper function, see it called from below

void oneUintToTwoUint(uint n, __local uint* ui1, __local uint* ui2)


	*ui1 = n & 0xFFFF;

	*ui2 = (n >> 16) & 0xFFFF; // um, I might very well leave out the & 0xFFFF part here, right? just noticed it...;)


uint candRingTest(


	__global __read_only uint* uiRingInfoBuf, 




	uint uiLocalId = get_local_id(0);


	__local uint uiRingStartIdx, uiLength;


	// I'm letting the first work-item of the group do the job in order to avoid that 64 threads 

	// (which is the work-group size) read and write the same variables. Did it like that a few 

	// times in other kernels before

	if (uiLocalId == 0)


		ringData1.bEvaluated = false;

		// the following takes a uint and extracts two numbers from the upper and lower 16 bits

		oneUintToTwoUint(uiRingInfoBuf[0], &uiRingStartIdx, &uiLength); // here it works




	// same operation as above, but split up to analyse

	if (uiLocalId == 0)


		ringData2.bEvaluated = false;

		uint start, length,  val;

		val = uiRingInfoBuf[1];

		oneUintToTwoUint(val, &start, &length);

		uiRingStartIdx = start;

		uiLength = length; // <-- this is the operation that causes the error; if commented, there's no error


	barrier(CLK_LOCAL_MEM_FENCE); // whether or not this line is present has no influence

	return 0; // stop kernel execution here while bug hunting; there's more code after this actually


Like I said before, if I change the code to work differently, no error occurs. For example, if I remove the (if uiLocalId == 0) statement, change uiRingStartIdx and uiLength to default (private) name space, and thus let each thread write its own variables, it works.

There are no changes of any input data, global, local, or whatever buffers between the iterations.

If you need further information about something, please let me know. I’d also be happy to hear that this is (probably) not my fault, even though you might not be able to give any helpful hints. ;)

Thanks in advance. :)

I’ve been told that the graphics card in charge has had problems with OpenGL occasionally in the past (including system freezes). I’ll check on another machine tomorrow to see if that’s the cause of my problem.


FYI, I’ve tried it on two other computers, resulting in two other different crashes (in other kernels…). I will investigate these first and report back then…

Sorry for the delay, but I was too busy the last days.

I have found out what goes wrong with my kernel. It is passed a uint2 value containing the dimensions of the image which is being processed. Strangely, what the kernel receives is bogus values, which leads to array accesses way out of bounds. So this clearly seems to be an Nvidia driver or compiler bug…

I still find it a very strange bug since, as I wrote above, it does not always fail. In fact, the first computer with a 9800 GTX+ (under Windows XP) behaves as described above, while another one with a GTX 285 (or 295, or something else, doesn’t matter anyway) (under Windows 7) not only crashes immediately and every time the kernel is executed, but also in another kernel which is executed previously to the one I spoke about before. This other kernel also contains a uint2 with image dimensions.
What’s also strange is that nobody else seems to have the problem. I guess uint2 is not that unusual to use as a kernel argument, so one would expect the bug to surface very often…

Anyway, to sum it all up:

  • one card breaks uint2 sometimes
  • another card breaks uint2 every time
  • the issues could be resolved by submitting two uints instead one one uint2

Unfortunately, there remain other strange issues with values that contain bogus data without any thinkable reason. I might ask the forum about this later, for now I will try to find out more about it.

Thanks for your time, and hopefully this will help somebody with similar problems.