OpenGL interop performance issues again... (or rather, still...)

Hey everyone,

I’m back to analyzing what part of our algorithm to optimize next - and the biggest performance hit right now appears to be OpenGL…

We have a couple of 320x240 images in OpenGL, 24bpp and 16bpp (so, 225kB & 150kB), which we need to have access to in our CUDA kernels once per frame.

At present, locking these images still takes 150-350s, and unlocking takes the same amount of time - for an accumulated time of 300-700us just to lock/unlock the buffers each time (twice per frame, this sometimes approaches 1.5ms, which is ludicrous - however tends to average 600-700us per frame)…

From what I understand, since CUDA 2.1 - transfer of OpenGL memory to CUDA memory wasn’t supposed to go via system memory (unless transferring between GPUs), however the performance I’m seeing is less than half my PCI express bandwidth - and barely a few % of my device bandwidth - so I’m having trouble understanding what the hell is going on…

Each buffer is created only once (at start of application), and registered/mapped just before use, and unmapped/unlocked just after use.
Mapping/Unmapping generally takes 20-50us, registering generally takes 150-300us, and unregistering generally 100-250us… excluding spikes (which can sometimes reach 800us just to register, another 400 to unregister, etc).

I get more or less similar performance on 8600, 8800, and Quadro FX 570 - all using CUDA 2.1, all on Windows XP 32bit…

Is this ‘performance’ due to the fact OpenGL interop still runs over the PCI bus (despite what I recall reading?), or is it the fact my data is too small to transfer - so I’m actually seeing the latency of the memory copy - and thus not achieving full bandwidth?

I’m guessing it’s the former, because I’m assuming the latter would reach far far more than ~1.4% of peak bandwidth (which is what I’m getting, ~600mb/s out of ~45GB/s)… right?

Or even worse, is this simply the expected performance for opengl interop? :blink:

Did you try Cuda 2.2 beta ? Its supposed to have better interop timings.

I have not, no - mainly because I’m not a registered developer, and 3 applications is enough for me to have given up trying.

You only need to register the buffer object once, not every time (this was an error in the documentation).

We are working on direct OpenGL texture interoperability, which should make stuff like this easier and higher performance.

I’ve been playing around with the postprocessGL example, and have found similar map/unmap timings. We are developing an application that requires some postprocessing of data for visualization, but on multiple monitors (and thus using two GPU cards). Once I have all the monitors enabled, I find that the mapping/unmapping times shoot up from the original 20-50us to a poky 5-10ms. I can only keep the map/unmap times in the sub-ms range if only one gpu is driving monitors. Is there any way around this? Driving only one video card is not a viable solution for us.

I’ve been having similar problems with cudaGLMapBufferObject and cudaGLUnmapBufferObject. For example, in a test application I developed, I allocate a VBO using the following code:

GLuint testVbo;

unsigned int testVboSize = 512*512*sizeof(float)*3;

glGenBuffersARB (1, &testVbo);

glBindBufferARB (GL_ARRAY_BUFFER_ARB, testVbo);





And then each frame I do the following:

void *test_ptr;

cudaGLMapBufferObject (&test_ptr, testVbo);

cudaGLUnmapBufferObject (testVbo);

I’ve timed the second block of code for the aforementioned VBO size, and the time I get is ~6.82ms. Is this expected? I read elsewhere on the forums that the map/unmap functions sometimes copy memory, but found no indication of what constituted this. Additionally, with a VBO size of 1 byte, it still takes ~0.8ms, which I don’t fully understand - what do these functions actually do?

Could anyone from NVIDIA clarify on why this is happening, or some solution to this problem (if there is one)? I have version 2.1 of the toolkit and SDK and am on Mac OS X 10.5.6 with a GeForce 8600M GT. Is this issue remedied in 2.2?

I should probably note that it still takes the same amount of time (maybe 50us-100us less overall), even if I only register once.

(not registering each time before mapping simply bumped the map time up to 120us or so, from 50)

Edit: Terminology fix - switched ‘lock’ with ‘map’.

One issue to be aware of is that calling the Map/Unmap functions causes a context switch between the OpenGL and CUDA drivers.

Therefore you should try and group your map calls at the beginning of the frame and the unmap calls at the end of the frame, rather than interleaving them with the CUDA code.