GL-Interoperability Slow? Especially cudaGLUnregisterBufferObject

Hi all,

I would like to visualize some data computed with CUDA and use the OpenGL interoperability calls to get the data into a pixel buffer object.
This is demonstrated in the “Post-Process in OpenGL” example which is using PBOs of size 512x512.
Now I noticed that the performance goes down tremendously when I increase the size of a PBO, to let’s say 2048x2048.

In that case the execution time of cudaGLUnregisterBufferObject() jumps from 0.5 to 20 milliseconds and further to, for example 80 milliseconds with a buffer size of 8192x8192.

The funny thing is that this causes the framerates in my programs to go down towards “unusable” before I even increase the grid size.

I would like to know if there is a faster way for getting big chunks of CUDA results into textures.