currently, I am working on rendering an object with OpenGL and post process this image.
My images’ size is 640x480 bytes, greyscale.
As glReadPixels was very slow when transferring data from the GPU to local memory (2.4ms), I decided to post progress the OpenGL rendered image direcly with CUDA.
I am surprised at an even worse performance on the GPU with glReadPixels (2.4ms) and cudaGLMapBufferObject (1ms).
Although my kernel code performs very well :), glReadPixels and cudaGLMapBufferObject seems to be a bottleneck?
Perhaps I have some mistakes in my code?
Thank you!
renderObjectWithOpenGL(...); //0.05ms-3ms (depends on object)
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, bufferIDx); //negligible amount of time
glBufferData(GL_PIXEL_PACK_BUFFER_ARB, 640*480, NULL, GL_STREAM_READ); //negligible amount of time
cudaGLRegisterBufferObject(bufferIDx); //0.4ms
glReadPixels(0, 0, 640, 480, GL_LUMINANCE, GL_UNSIGNED_BYTE, 0); //2.4ms
cudaGLMapBufferObject( (void**)&in_data, bufferIDx); //1ms
launch_kernel(in_data, d_result);
cudaThreadSynchronize();
cudaGLUnmapBufferObject( bufferIDx); //0.4ms
cudaGLUnregisterBufferObject(bufferIDx); //0.3ms
cudaMemcpy( resultcuda, d_result, blub*sizeof(float), cudaMemcpyDeviceToHost);
glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0); //negligible amount of time
First of all, did you use cudaGLSetGLDevice to set the device? (should be called after OpenGL was initialized if I recall correctly) to sepecify that the cuda context will use OpenGL interoperability
Second try using the GL_DYNAMIC_COPY_ARB flag instead of GL_STREAM_READ. The flags make a difference as the hint on where the buffer should be created. If you are not careful then glReadPixels will copy the data via the host instead of leaving it on the device and/or will make sure to cache the data in some cases. Correct flags can make a very big difference here.