cuda 3: cudaGraphicsMapResources performance issue linux 32bit, driver 195.30, macbookpro nvidia 960

hi cuda users

i have a deferred shading setup with the following steps:

  1. render (GLSL) to FBO with multiple renderbuffers attached

  2. transfer some of the renderbuffers via PBO/TBO and cudaGraphicsMapResources to CUDA

  3. process pixels with CUDA

  4. transfer back to texture using PBO, also map the non-cuda-processed pbo to texture

  5. render result (GLSL) using TBOs and usual textures

my problem is that cudaGraphicsMapResources in step 2 takes 20ms (macbookpro nvidia 9600M) to map 4 PBOs (2 read/ 2 write) as cuda pointers. i expect this number to be ~1ms or less… is my expectation wrong?

related question: i read in the programming guide for 3.0 that renderbuffers can be mapped directly with cuda (avoid pbo) using cudaGraphicsGLRegisterBuffer but i cannot get it to work. does anybody have some example code for that (RBO -> cuda texture)???

note: i use a cpu timer and do a cudaThreadSynchronize before/after each start/stop of the timer. i render with 1024x768 resolution

here are the relevant code sections, first initialisation:

btw: my code is inspired from this nice blog post:…tyapiwithopengl

and the render loop:

i appreciate any help!

kind regards,


hi again

i have investigated some more and also switched back to the old interop API where i noticed that only the first of my four cudaGLMapBufferObject calls is taking too long (~19ms), the subsequent calls use 0.4ms. i used the following code to measure the time:


cutilSafeCall( cudaEventRecord( mCudaEventStart, 0 ) );

cutilSafeCall(cudaGLMapBufferObject((void**)&mCudaDevStartPixels, mPBO[0]));

cutilSafeCall( cudaEventRecord( mCudaEventStop, 0 ) );

cutilSafeCall( cudaEventSynchronize( mCudaEventStop ) );


so, did anybody experience a similar behavior?



Maybe the delay is due to OpenGL not having finished up. The function ‘cudaGraphicsMapResources’ guarantees that all graphics calls are finished before it is executed (from the function documentation):

So I suppose it calls glFinish() or something similar before it is executed, maybe this is stalling the operation. Have you tried calling glFinish() just before the call to cudaGraphicsMapResources()? That way you could test if the stall is due to OpenGL not having finished its operation.



hi paul

you were right, an actual glFinish/glFlush command revealed that the slowdown is caused by glReadPixels etc… i opened a new topic about what performance to ideally expect:

thanks for your help,