Kernel execution fails with error CL_OUT_OF_RESOURCES HELP

Hey there,

I have a problem with my kernel:
Although I am using a smaller local_work_size than the recommended one (queried
via clGetKernelWorkGroupInfo – 256) on my device (Geforce 8600M GT), the kernel execution
fails.

2 dimensions, local work size = {8, 16} (also tried {8,24}, {12, 24}).

The function clEnqueueNDRangeKernel returns with error code CL_OUT_OF_RESOURCES. When
using larger local work sizes (but still less than the maximum of 512!) CL_INVALID_VALUE is returned
by this function.

I just cannot figure out what the problem is. Also, I do not know how to look for it efficiently!
I tried commenting out some parts and, dependent on the local_work_size values, it sometimes
“crashes” at the last statement (assignment to a result-array) sometimes much earlier in the code.

With the small local_work_size values (8,16) clEnqueueNDRangeKernel seems to execute as it should,
but then I get a CL_OUT_OF_RESOURCES when trying to fetch the results via clEnqueueReadBuffer
(additionally, when at this point, the screen blinks for one moment!).

As mentioned before, I do not have any clue what to do to figure out how to solve this problem.
Any suggestions?

Thanks in advance!

I figured it out:
The error CL_OUT_OF_RESOURCES happens (in my case) only if I the problem size is too large for my device. With small input arrays it works fine. With larger ones it crashes :(

Still, I have a tiny logical error in my kernel.

If anyone has any suggestions how to look for it efficiently
or if anyone knows a way how to debug opencl kernels, please let me know!

Thanks in advance!

I’ve gotten this error when one of my local memory allocations wasn’t big enough after I forgot to resize it for a new block size. Check your memory allocations.

I think the first thing to try is dropping your workgroupsize by half and see what happens. It seems like everyone keeps getting this error, and I just did for the first time today. I think one of the bugs might be that max work group size is calculated incorrectly in some cases.

I needed a kernel that found the median of a *global float, with avg length of about 1000. I was working on a kernel that specifically did that, but I also tried searching for sorting kernels, see thread [post=“0”]GPU_QuickSort, an OpenCL implementation somewhere?[/post].

The specialized kernel is now done in it’s first iteration, without an Atomic to indicate a work Item found it already. I developed & ran it on a MacBookPro. The M9400 averaged only 1.28 mSec for 1000 #'s, not too bad. When the M9600 did not do any better, I thought I ought to bring it over to CUDA / 8800GTX, before I put in atomics & see what that time was.

You guessed it, out of resources. It was taking a really long time to run just 1 iteration ( I measure the avg of runs 2 to 101). I thought it was going to watch dog. It might have. The screen blinked. The max work group size was 448 (512 on OSX), and that is what I used. I backed off to 256, and it ran. I averaged only 0.75 mSec.

The answer, for the time being, on to how to debug OpenCL might be to develop it on OSX, if you can. It ran the first time over there! That almost never happens to me. I would have been pulling my hair out by now if I had done it on CUDA, having ripped everything apart of perfectly working code.