(Update: See next post)
I’ve been hunting desperately something which I first assumed to be a bug in my code, but which now seems to possibly be one in Nvidia’s driver (or somewhere else).
What my kernel is basically doing is to perform a certain image analysis of static images (part of a eye recognition software). There are different (test) images, and there’re parameters that can be changed which can create different results, but the error occurs without changing either of these (read: doing the very same thing multiple times fails at some point).
After executing the kernel via clEnqueueNDRangeKernel, the next call to a OpenCL function fails with a CL_INVALID_COMMAND_QUEUE error. What causes the error is a write operation to a single uint in local memory by a single thread. The error is reproducible more or less, that is, given the conditions for its occurance, it will occur within a few iterations (sometimes a few hundred, but most of the time only tens of them). The problem is that the input data does not change (to be more correct: the error occurs more or less independently from the input data; it does not occur with certain input data, but will very sure show up for other cases).
I was able to find out the exact operation that causes the error; without it or done in another way, there seems to be no error. See the code:
// helper function, see it called from below
void oneUintToTwoUint(uint n, __local uint* ui1, __local uint* ui2)
{
*ui1 = n & 0xFFFF;
*ui2 = (n >> 16) & 0xFFFF; // um, I might very well leave out the & 0xFFFF part here, right? just noticed it...;)
}
uint candRingTest(
...
__global __read_only uint* uiRingInfoBuf,
...
)
{
uint uiLocalId = get_local_id(0);
...
__local uint uiRingStartIdx, uiLength;
...
// I'm letting the first work-item of the group do the job in order to avoid that 64 threads
// (which is the work-group size) read and write the same variables. Did it like that a few
// times in other kernels before
if (uiLocalId == 0)
{
ringData1.bEvaluated = false;
// the following takes a uint and extracts two numbers from the upper and lower 16 bits
oneUintToTwoUint(uiRingInfoBuf[0], &uiRingStartIdx, &uiLength); // here it works
}
barrier(CLK_LOCAL_MEM_FENCE);
...
// same operation as above, but split up to analyse
if (uiLocalId == 0)
{
ringData2.bEvaluated = false;
uint start, length, val;
val = uiRingInfoBuf[1];
oneUintToTwoUint(val, &start, &length);
uiRingStartIdx = start;
uiLength = length; // <-- this is the operation that causes the error; if commented, there's no error
}
barrier(CLK_LOCAL_MEM_FENCE); // whether or not this line is present has no influence
return 0; // stop kernel execution here while bug hunting; there's more code after this actually
}
Like I said before, if I change the code to work differently, no error occurs. For example, if I remove the (if uiLocalId == 0) statement, change uiRingStartIdx and uiLength to default (private) name space, and thus let each thread write its own variables, it works.
There are no changes of any input data, global, local, or whatever buffers between the iterations.
If you need further information about something, please let me know. I’d also be happy to hear that this is (probably) not my fault, even though you might not be able to give any helpful hints. ;)
Thanks in advance. :)