Context Switch between CUDA and OpenGL

I have an algorithm that requires heavy interaction between CUDA and OpenGL:

for(int i = 0; i < 500; i++)

{

  launchCUDAKernel();

  renderResultsWithOpenGL();  // use the data created by CUDA Kernel

}

My CUDA Kernel is really cheap and I noticed that I am actually context-switch-bound!

Unfortunately, I have absolutely no chance whatsoever for batching. So cannot transform it into:

for(int i = 0; i < 500; i++)

{

  launchCUDAKernel();

}

for(int i = 0; i < 500; i++)

{

  renderResultsWithOpenGL();  // use the data created by CUDA Kernel

}

I have heard with DirectCompute, context switches come cheaper. Is that true, and why is that the case? Are there any performance caveats? It is important for me know if it is worth switching to DirectCompute, as this entails a lot of work.

anyone?