Kernels get killed: CL_OUT_OF_RESOURCES error waiting for idle


Does anyone know how to fix this problem? For certain kinds of (otherwise error-free) kernels that run in a long loop, like this one:

for (int i = 0; i < N; i++) {
// some writes
// memory barrier
// some reads

for large enough N and large enough run size, the kernel is killed resulting in CL_INVALID_COMMAND_QUEUE in subsequent calls, and sometimes (!) pfn_notify, the error callback passed to clCreateContext, receives the following message: “CL_OUT_OF_RESOURCES error waiting for idle on GeForce GTX 580.” This also happens when there are many atomic accesses in long loops.

Do you know what causes this? I’m imagining some timer times out because the driver believes the threads are in an infinite loop, and kills the kernel. It is notable that the run size has to be wide enough for this to happen, which means that this is probably caused by an impatient scheduler.

I’m not really asking NVIDIA OpenCL developers to solve the halting problem, but a more reasonable timeout period like at least a few seconds would let the more complicated kernels to run.

This is also an extremely frustrating occurrence: I don’t know how to predict when this will happen, and 19 out of 20 times my code doesn’t even get the pfn_notfy callback, so I effectively have to way of knowing what happened most of the time. Is there a parameter I can set to control this? Does anybody have any insight? Thanks.

P.S. BTW this is running the “OpenCL 1.1 CUDA 4.0.1” drivers

I am seeing the same problem with the 280.13 drivers. No such error with the 270.41.19 drivers.