clReleaseEvent is the new bottleneck

The good news is GF100 very fast, and much attention has been paid to kernel execution. The bad news is kernel execution time is now just peach fuzz compared with releasing an event.

I have known about this for about 2 weeks and tried everything to get around this, including trying Linux & calling clReleaseEvent asynchronously in it’s own thread (separate from Java’s garbage collector). I failed. What is weird is that it is not fixed overhead, but seems to increase with kernel execution time.

I use Java. The Netbeans IDE has a CPU profiling feature. I measured a stand alone “exec” that unit tests/determines Optimal group size of one my kernel components, which finds a median of a set of numbers (for the test 2000 #'s). This kernel is called from within another kernel in actual use, so the percentages are a little extreme in this output, but I still see that the larger system kernels spend about 50%-60% of the overall time just releasing the event.

I have included screen shots of the profiler output for both Win7 & Ubuntu 10.4 from the same Quad-Core Extreme machine with a GTX 480. Driver versions are 197.41 & 256.22, respectively. The Linux system is a little faster, but the relationship is the same.

This appears to me to be well worth looking into. EDIT(Win7 is the one on the left)


I had a short look at the JavaCL source code, and did not find any obviously “wrong” synchronizations or wait conditions. The OpenCL specification says:
“The event object is deleted once the reference count becomes zero, the specific command identified by this event has completed (or terminated) and there are no commands in the command-queues of a context that require a wait for this event to complete.”
I just wonder if it might be the case that the function takes so long because it is (for any reason, at any place) waiting for the actual kernel to be finished…? This would at least explain your statement that…
“it is not fixed overhead, but seems to increase with kernel execution time”

Did you measure the actual execution time for the kernel, by enabling profiling and computing the time between CL_PROFILING_COMMAND_QUEUED or CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END ?



Thanks for looking into this. I did try device profiling, which required that I do a finish instead a waitForEvent. If I did not, I got a CL_PROFILING_INFO_NOT_AVAILABLE. The profiling data returned had a much larger time than value shown in the original post for enqueueWaitForEvent.

I went back, and switched to the CLEvent.waitFor() method. Releasing the event after that was very fast.