I found a few other posts about this using the forum search, but no solutions. I’m working on a proof-of-concept for using CUDA as a deinterlacer. This needs to get a frame out every ~16ms, but the program spends most of its time mapping and unmapping the OpenGL pixel buffer objects. For 640x480 video, the average map+unmap time is 8.6ms, and for 1920x1280 video, the time is 35.4ms!
This is a naive bob deinterlacer that uses 9 PBOs: Y, Cb, and Cr planes for the source frame, even fields destination, and odd fields destination. The data in every buffer are unsigned chars. The source frame is initialized thusly:
GLuint buffers[3];
glGenBuffersARB(3, buffers);
for(i = 0; i < 3; i++) {
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, buffers[i]);
glBufferDataARB(GL_PIXEL_UNPACK_BUFFER_ARB, widths[i]*heights[i], data[i], GL_STREAM_DRAW_ARB);
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, 0);
}
for(i = 0; i < 3; i++) {
error = cudaGLRegisterBufferObject(buffers[i]);
...error handling...
}
And the destination frames are initialized the same way, with data being an array of NULL pointers instead of unsigned char *s.
When the program is ready to use CUDA to split the fields, it works like this:
uint8_t *f_y, *f_cr, *f_cb;
...more declarations...
cudaGLMapBufferObject((void**)&f_y , frame->pbo_y);
cudaGLMapBufferObject((void**)&f_cr, frame->pbo_cr);
cudaGLMapBufferObject((void**)&f_cb, frame->pbo_cb);
...more mapping...
...call field splitter kernel (<1ms)...
cudaGLUnmapBufferObject(frame->pbo_y);
cudaGLUnmapBufferObject(frame->pbo_cr);
cudaGLUnmapBufferObject(frame->pbo_cb);
...more unmapping...
This code all works, but too slowly. With the timings I have, I suspect it would be faster to copy the results to the host and then back to OpenGL. Am I doing something wrong, is this a bug, or something else?
CUDA 1.1 on Linux x86-64, driver version 169.09, 8800 GT hardware.