A problem of CUDA & OpenGL interoperation

Hello ~ all !!
I checked the past articles and got some information about CUDA & OpenGL inter-operation …
The performance of (cudaGLMapBufferObject & cudaGLUnmapBufferObject) is bad…

web : http://forums.nvidia.com/index.php?showtop…MapBufferObject

I checked what’s new about CUDA 2.1… , and it said "OpenGL interoperability improvements "
SO I tested whether it improved or not… as below shows…

Project : SimpleGL
result :

           CUDA 2.0 

           cudaGLMapBufferObject        <0.27ms>
           cudaGLUnmapBufferObject    <16.5ms>    //<--- too slow
           CUDA 2.1

           cudaGLMapBufferObject        <16.1ms>    //<--- too slow
           cudaGLUnmapBufferObject    <0.21ms>

The result is swapped!! :blink:
But… it’s still slow …

Cuz …I need to apply OpenGL & CUDA inter-operation to finish my project…
But… this problem took my program a lot of time… >.<

Does anybody or NVidia Engineer could give me any suggestion ~? thanks~~!!

Did you try to measure the timings with and without calling the cuda kernel ? sometimes the execution is delayed and its hard to measure the exact place where the delay occurs.

I had encountered a similar problem, but in case of using the frame-buffer-object together with a pixel-buffer object for post-processing. In case PBO was bind while the FBO was used, there was an significant delay. Unbinding the PBO before using the FBO fixed the problem. The strange thing was, that the delay didnt happen in the FBO part, but while calling glutSwapBuffers()

glFlush() and cudaThreadSynchronize() might help you too to get more accurate timings.

I just tested the time without any CUDA Kernel function…

Because I found that cudaGLMapBufferObject & cudaGLUnmapBufferObject cost a lot of time …

Testing time method I used is shown as below

unsigned int timer ;

float elapsedTimeInMs = 0 ;

CUT_SAFE_CALL( cutCreateTimer( &timer ) );

CUT_SAFE_CALL( cutStartTimer ( timer) );

cudaGLMapBufferObject () ;

CUT_SAFE_CALL( cutStopTimer ( timer));

elapsedTimeInMs = cutGetAverageTimerValue( timer);

CUT_SAFE_CALL( cutDeleteTimer( timer));

Yes…I also use PBO & FBO !!.. :thumbup:

The steps I followed is like this…

  1. Create PBO()

  2. Bind PBO()

  3. glDrawBuffer(GL_BACK)

  4. renderScene()

  5. glReadPixels(0, 0, width, height,GL_RGB,GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));

  6. Bind Buffer PBO(0)

  7. register & map PBO to CUDA //<–cudaGLMapBufferObject costs a lot of time

  8. Get data from PBO & and do kernel function

9.Unmap & Unregister PBO //<—cudaGLUnmapBufferObject costs a lot of time


Three of these steps are my bottleneck (FBO data to PBO + cudaGLMapBufferObject + cudaGLUnmapBufferObject )…

They took a lot of time that I can’t get good performance…


I have the same problem, the bigger the kernel’ grid size, the much more delayed to call glutSwapBuffers(), when grid’size is 10000, the delay time is more than 100ms, so it is a big problem, I don’t know how to solve it, is there someone know?

Do not use glReadPixels as it is very slow.

Use glTex(Sub)Image2D instead:


	glBindTexture(target, dest_texture);


		glTexSubImage2D(GL_TEXTURE_RECTANGLE_NV, 0, 0, 0,

						width>>decimate_output, height>>decimate_output,


		glBindTexture(target, 0);


I believe part of the unmapping time you are measuring comes from the glReadPixels command still in flight.