I think this is due to the vertical sync option being turned on or being left to the application, which could be by default sync on. The vertical sync forces the frames to come out as your display refreshes frames, this avoid image tearing but forces the graphics card to output just a single frame per monitor refresh which is usually set at 60Hz, so in a second, that is in a 1000 ms, you can have at most 60 frames, so a single frame will take 1000/60 = (roughly) 16 ms.
I must add here that the buffer swaps are usually fast, if your buffer object isn’t huge, in fact if you’re using CUDA interoperability with OpenGL, you must be mapping and unmapping the buffer object and copying things to it from a CUDA array (or whatever) and that takes more time than the buffer swap. The vertical sync is apparently applied at that stage to make the swap wait until the right time is reached.
The cost to worry about perhaps is the cost of mapping and unmapping and the actual copy operation. I do not know how much asynchronous the copy operation can be, with cudaMemcpy or a kernel launched on another stream, because it has to be “wrapped” around the mapping and unmapping phases, and these would require the operations to be finished. I am yet to experiment with shuffled calls so I don’t know if its possible. One thing to do is to skip frames to display, so a condition around the entire ‘map copy unmap’ stage would pass every other OpenGL loop iteration, you would cut the cost in half with some little loss of continuity, shouldn’t be too bad in general.
You can turn the vertical sync off from the Nvidia controls, somewhere around the bottom of the 3d options list. Can’t say exactly where I am not currently on an Nvidia machine.