Slow OpenCL/OpenGL context switch Time wasted every frame when context switching

Hello.
I wrote this application which uses 6 3D viewports (6 different views of the world) and executes tenthes of different OpenCL kernels on the textured surface created. Fortunately I manage to keep it 60Hz, but when I started profiling (Using shtDelta functions found in NVidia SDK) I found out that most of the time is wasted on the first kernel (Which actually copies input to output , thus does nothing…) and practically no time is wasted on the other tenthes of kernels , which are fairly complex and execute difficult image processing techniques. I thus assume (and hope to be right) that the payment here is on the context switch between OpenGL/OpenCL. Is that so ? Is there a way to avoid that ?

I also tried NVidia OpenCL profiler to try and understand the nature of the problem, yet this one showed times in microseconds which don’t resemble any kernel times I measured. Could it be that NVidia got the “microseconds” wrong with “Nanoseconds?”. The context switches, by the way , show on the profiler as 3.62% consumers of the GPU, so how come I get this delay measuring on the CPU ? Does the driver’s return ring buffer suffer from some latency I can’t tell ?(Which explains the more accurate profiler results…)

I also took the liberty and executed the oclVolumeRender sample , which does 60 Hz on my 470 card, measuring (again…) 16 milliseconds for a kernel to whom which the profiler gives a totally different result.

Ideas will be welcomed.
Eyal.