First I apologize if my terminology is off; I am trying to be descriptive.
I have modified the postProcessGL SDK example such that I do not actually display the results of the screen contents I have operated upon. I read the back buffer into a pixel buffer object (PBO), perform some arithmetic and store the result elsewhere. The PBO data is unchanged by the arithmetic.
I then display the PBO contents. This leads to some slowdown when CUDA code is activated as with the pristine postProcessGL example. I have found that if I populate a second pixel buffer object in the same manner as the first (glReadPixels), exclude it from the CUDA portion of the code (ie. it is not mapped), and display this second untouched PBO, then the teapot animation is as smooth as when the CUDA code is excluded (by pressing space bar).
My question is: does nvcc scan code for data dependencies and parallelize operations when it can? What could explain the fact that the animation is smoother even though I am introducing another glReadPixels of the entire window?