The good news is GF100 very fast, and much attention has been paid to kernel execution. The bad news is kernel execution time is now just peach fuzz compared with releasing an event.
I have known about this for about 2 weeks and tried everything to get around this, including trying Linux & calling clReleaseEvent asynchronously in it’s own thread (separate from Java’s garbage collector). I failed. What is weird is that it is not fixed overhead, but seems to increase with kernel execution time.
I use Java. The Netbeans IDE has a CPU profiling feature. I measured a stand alone “exec” that unit tests/determines Optimal group size of one my kernel components, which finds a median of a set of numbers (for the test 2000 #'s). This kernel is called from within another kernel in actual use, so the percentages are a little extreme in this output, but I still see that the larger system kernels spend about 50%-60% of the overall time just releasing the event.
I have included screen shots of the profiler output for both Win7 & Ubuntu 10.4 from the same Quad-Core Extreme machine with a GTX 480. Driver versions are 197.41 & 256.22, respectively. The Linux system is a little faster, but the relationship is the same.
This appears to me to be well worth looking into. EDIT(Win7 is the one on the left)