cudaGLMapBufferObject (and unmap) performance These calls take way too long

I found a few other posts about this using the forum search, but no solutions. I’m working on a proof-of-concept for using CUDA as a deinterlacer. This needs to get a frame out every ~16ms, but the program spends most of its time mapping and unmapping the OpenGL pixel buffer objects. For 640x480 video, the average map+unmap time is 8.6ms, and for 1920x1280 video, the time is 35.4ms!

This is a naive bob deinterlacer that uses 9 PBOs: Y, Cb, and Cr planes for the source frame, even fields destination, and odd fields destination. The data in every buffer are unsigned chars. The source frame is initialized thusly:

GLuint buffers[3];

glGenBuffersARB(3, buffers);

for(i = 0; i < 3; i++) {

    glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, buffers[i]);

    glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, widths[i]*heights[i], data[i], GL_STREAM_DRAW_ARB);

    glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);

}

for(i = 0; i < 3; i++) {

    error = cudaGLRegisterBufferObject(buffers[i]);

    ...error handling...

}

And the destination frames are initialized the same way, with data being an array of NULL pointers instead of unsigned char *s.

When the program is ready to use CUDA to split the fields, it works like this:

uint8_t *f_y, *f_cr, *f_cb;

...more declarations...

cudaGLMapBufferObject((void**)&f_y , frame->pbo_y);

cudaGLMapBufferObject((void**)&f_cr, frame->pbo_cr);

cudaGLMapBufferObject((void**)&f_cb, frame->pbo_cb);

...more mapping...

...call field splitter kernel (<1ms)...

cudaGLUnmapBufferObject(frame->pbo_y);

cudaGLUnmapBufferObject(frame->pbo_cr);

cudaGLUnmapBufferObject(frame->pbo_cb);

...more unmapping...

This code all works, but too slowly. With the timings I have, I suspect it would be faster to copy the results to the host and then back to OpenGL. Am I doing something wrong, is this a bug, or something else?

CUDA 1.1 on Linux x86-64, driver version 169.09, 8800 GT hardware.

Could you post the source code with a reproduction case?

Paulius

Here’s the source code. There’s a short 1080i MPEG-2 clip included for testing. The video should display in a GLUT window, and the timing information should be printed to stdout.

I’ve got the same problem.

I’m loading OpenGL Texture Data (1920x1080) to CUDA, using the OpenGL Buffers. When I time the following operation:

// buffer initialization

GLuint bufferObj;

GLsizei bufferSize = 1920 * 1080 * 4;

glGenBuffers(1, &bufferObj);

glBindBuffer(GL_PIXEL_PACK_BUFFER, bufferObj);

glBufferData(GL_PIXEL_PACK_BUFFER, bufferSize, NULL, GL_DYNAMIC_READ);

// perform texture data copy to opengl buffer

cutResetTimer(hTimer);

cutStartTimer(hTimer);

    glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);

cutStopTimer(hTimer);

printf("OpenGLBuffer copy time: %f msec\n", cutGetTimerValue(hTimer));

it takes about 10 msec. When I copy the same amount of memory allocated on the host, using cudaMemcpy as follows:

unsigned char * h_Data      = (unsigned char *)malloc(bufferSize);

cutResetTimer(hTimer);

cutStartTimer(hTimer);

  CUDA_SAFE_CALL( cudaMemcpy(d_Data, h_Data, bufferSize, cudaMemcpyHostToDevice) );

cutStopTimer(hTimer);

printf("cudaMemcpy: %f msec\n", cutGetTimerValue(hTimer));

it takes only 6 msec!

What am I doing wrong?

I have the same problem, it is quicker for me to copy from opengl to host memory and then from host to cuda, process and after copy from cuda to host and from host to opengl.
Pixel Buffer Objects are really slow, does this will change in cuda 2.0 ?

Excuse me for reviving such an old topic, but my applications are still memory bandwidth limited, so this is actually quite important to me. I experience the same problems with getting data into OpenGL. I ran a couple of timing tests that may be of interest.

For every scenario I ran the test 100 times and averaged the run times. For every trial, buffers were first mapped, memory copied and then the buffers were unmapped again, simulating a frame-per-frame copy. Wherever OpenGL flags were appropriate I used GL_WRITE_ONLY and GL_STREAM_DRAW. The memory buffer size is 128096016 bytes everywhere. ‘host_malloc’ is host memory allocated by malloc(), ‘host_cuda’ allocated by cudaMallocHost(), ‘dev_cuda’ is device memory by cudaMalloc() and finally ‘dev_vbo’ is an OpenGL buffer created with STREAM_DRAW, mapped once with WRITE_ONLY (I once read cuda inherits this flag, so just to be sure) and registered with cuda. A sample trial:

start = microtime();

	for (int i=0; i<n; ++i) {

  CUDA_CALL(cudaMemcpy(host_cuda, dev_cuda, buffer_size, cudaMemcpyDeviceToHost));

  glBindBuffer(GL_ARRAY_BUFFER, dev_vbo);

  void* data = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);

  memcpy(data, host_cuda, buffer_size);

  glUnmapBuffer(GL_ARRAY_BUFFER);

  glBindBuffer(GL_ARRAY_BUFFER, 0);

	}

	end = microtime();

	printf("dev_cuda -> host_cuda -> dev_vbo (ARRAY_BUFFER / glMapBuffer) : %f GB/s\n", n*buffer_size/((end-start)*1024*1024*1000));

System info:

2.6.22-14-generic #1 SMP Tue Feb 12 02:46:46 UTC 2008 x86_64 GNU/Linux Intel Core 2 Duo T7500 @ 2.20Ghz

GPU: GeForce 8600M GT 512 MB (CUDA 1.1) @ 0.95Ghz

Driver: 171.06 beta

Results:

host_cuda -> dev_cuda (cudaMemcpy) : 2.120048 GB/s

host_malloc -> dev_cuda (cudaMemcpy) : 1.077509 GB/s

host_cuda -> dev_vbo (cudaGLMapBufferObject + cudaMemcpy): 0.438739 GB/s

host_malloc -> dev_vbo (cudaGLMapBufferObject + cudaMemcpy): 0.365326 GB/s

dev_vbo (cudaGLMapBufferObject ONLY, no copying): 0.542127 GB/s

host_malloc -> dev_vbo (PIXEL_UNPACK_BUFFER / glBufferData) : 0.799993 GB/s

host_malloc -> dev_vbo (ARRAY_BUFFER / glBufferData) : 0.787505 GB/s

host_cuda -> dev_vbo (PIXEL_UNPACK_BUFFER / glBufferData) : 0.793959 GB/s

host_cuda -> dev_vbo (ARRAY_BUFFER / glBufferData) : 0.860424 GB/s

host_cuda -> dev_vbo (ARRAY_BUFFER / glMapBuffer + memcpy) : 1.739634 GB/s

host_malloc -> dev_vbo (ARRAY_BUFFER / glMapBuffer + memcpy) : 1.773318 GB/s

host_cuda -> dev_vbo (PIXEL_UNPACK_BUFFER / glMapBuffer + memcpy) : 1.818944 GB/s

host_malloc -> dev_vbo (PIXEL_UNPACK_BUFFER / glMapBuffer + memcpy) : 1.824821 GB/s

dev_cuda -> dev_vbo (cudaGLMapBufferObject + cudaMemcpy) : 0.463311 GB/s

dev_cuda -> host_cuda -> dev_vbo (ARRAY_BUFFER / glBufferData) : 0.542311 GB/s

dev_cuda -> host_cuda -> dev_vbo (PIXEL_UNPACK_BUFFER / glBufferData) : 0.560829 GB/s

dev_cuda -> host_cuda -> dev_vbo (ARRAY_BUFFER / glMapBuffer) : 0.848025 GB/s

Conclusions for this specific hardware setup (I’ll test a Quadro FX5600 and a 8800GTX later. I expect the Quadro to behave differently, from previous experience)

  • Always use cuda pinned host memory (cuda can directly DMA to the hardware from this memory, for some cases resulting in 2x speed-ups)

  • cudaGLMapBufferObject is horribly slow. I suspect it copies the device memory to the host and then reuploads it into OpenGL after modification. It does not appear to be possible to copy directly from cuda device to opengl device memory.

  • glMapBuffer and memcpy appears to be the fastest way to get data into an opengl vbo/pbo.

  • It is actually two times faster to copy cuda device memory to cuda host memory, and then copy it back to OpenGL using glMapBuffer, then to use cudaGLMapBufferObject and cudaMemcpyDeviceToDevice.

In my opinion this is very bad, as many application compute something using Cuda, subsequently map a GL buffer and copy the data there for display. Mostly people assume the data will remain on the device, implying roughly ~50-60GB/s transfer speeds, but this does not seem to be the case at all. This can actually be a bottleneck for applications displaying output at large resolutions.

I just ran into the same problem and I -cannot- believe it.

So far, for me, the conclusion is simple: forget about Cuda and go back to GLSL…

Could somebody from NVidia give us some lights on what’s going on here?

Haven’t done extensive tests, but I have found some internal/external formats that work horribly slow. For me, the fastest transfer is from RGBA with ubyte as internal and external formats.

But I can’t seem to find any bottleneck in PBO transfers. Example with 1280x1024 PBO:

// Init

glBufferData( GL_PIXEL_UNPACK_BUFFER, ..., NULL, GL_STREAM_COPY )

// Begin Loop

glBindTexture(...)

glBindBuffer( GL_PIXEL_UNPACK_BUFFER, ... )

cudaGLMapBufferObject(...)

// do nothing

cudaGLUnmapBufferObject(...)

glTexSubImage2D(...)

// draw quad with PBO texture

// End loop

I get more than 1600 fps, which equates the whole operation to about 0.625ms. If I add an empty kernel call while the PBO is mapped, I get about 1000 fps or 1 ms. Don’t know if this performance is enough for your applications.

As for host → device copy, I only do that once so never bothered to time it correctly.

I encountered this problem when I updated CUDA from 1.1 to 2.0 when cudaGLMapBufferObject took a lot of running time. Then I found out that for the same PBO, only the first call to cudaGLMapBufferObject was slow. If you Map a PBO at an Initialization stage, then try to measure 100 times of Map - Unmap that same PBO, you will see that the running time is actually the same as version 1.1, which was almost instantly. Can anyone verify this again, 'cause I don’t have access to CUDA at the time being.

Cuda 2.0???

Anyway, with 1.1, the map/unmap opertion, in my case, is slow each single time it is call (even without any modification to the GL buffer).

We are aware that the current implementation of OpenGL interop has some performance problems and are working on improving it.

You should see some improvement in the forthcoming CUDA 2.0 release.

For the record, my timings above were for a 100 times repeated experiment. You always have to time something in a loop like that to avoid erroneous measurements.

Thank you for the response, Simon. Looking forward to it.

That’s pretty much the perf I get as well. But I’m playing with high-res texture (around 5000x5000) so things get worse.

Also, I have to admit I originally thought the map/unmap operation would be essentially free, not requiring any copy. After all, the texture is already here in the video RAM!

Now I have to do the following copy: GL texture → GL buffer → cuda memory.

And considering the perf I got for the high res I would not be surprised if it was doing:

GL texture → GL buffer → host → cuda memory.

Good to know, thanks!

Sorry, my mistake, I’m talking about the upgrade from CUDA 1.0 to CUDA 1.1. Anyway, I did see the slow down of PBO mapping between CUDA 1.0 and CUDA 1.1. I also noticed in my code that copying a texture to PBO was really slow (less than 2GB/s), so it is the CUDA buffer mapping that is slow. Hope this will be fixed soon.

(duplicate)

Hello Simon! You guys have any official date or estimate on the CUDA 2.0 release?

I’m sorry if this question is not completely performance related, but I hope someone could help me here.

I tried to use Pixel Buffer Objects to interact with OpenGL textures and followed the sample code from the postProcessGL demo. Unfortunately my application would crash during the cudaGLMapBufferObject() - call without ANY cuda or OpenGL error message.

After hours and hours of trying to debug my code I think the problem is related to the CEGUI (http://www.cegui.org.uk) I am using.

Has anyone ever heard of this before - or is anyone using the CEGUI and has not seen this behaviour in their implementation?

Any advise would be more than welcome.

thanks, Chris

I just wanted to add a quick ‘me too’ on the issue of PBO speed, and now that I have replied I can get emails of any responses to this thread.

Thanks.