OpenGL interop performance issues again... (or rather, still...)

Smokey · April 2, 2009, 11:47pm

Hey everyone,

I’m back to analyzing what part of our algorithm to optimize next - and the biggest performance hit right now appears to be OpenGL…

We have a couple of 320x240 images in OpenGL, 24bpp and 16bpp (so, 225kB & 150kB), which we need to have access to in our CUDA kernels once per frame.

At present, locking these images still takes 150-350s, and unlocking takes the same amount of time - for an accumulated time of 300-700us just to lock/unlock the buffers each time (twice per frame, this sometimes approaches 1.5ms, which is ludicrous - however tends to average 600-700us per frame)…

From what I understand, since CUDA 2.1 - transfer of OpenGL memory to CUDA memory wasn’t supposed to go via system memory (unless transferring between GPUs), however the performance I’m seeing is less than half my PCI express bandwidth - and barely a few % of my device bandwidth - so I’m having trouble understanding what the hell is going on…

Each buffer is created only once (at start of application), and registered/mapped just before use, and unmapped/unlocked just after use.
Mapping/Unmapping generally takes 20-50us, registering generally takes 150-300us, and unregistering generally 100-250us… excluding spikes (which can sometimes reach 800us just to register, another 400 to unregister, etc).

I get more or less similar performance on 8600, 8800, and Quadro FX 570 - all using CUDA 2.1, all on Windows XP 32bit…

Is this ‘performance’ due to the fact OpenGL interop still runs over the PCI bus (despite what I recall reading?), or is it the fact my data is too small to transfer - so I’m actually seeing the latency of the memory copy - and thus not achieving full bandwidth?

I’m guessing it’s the former, because I’m assuming the latter would reach far far more than ~1.4% of peak bandwidth (which is what I’m getting, ~600mb/s out of ~45GB/s)… right?

Or even worse, is this simply the expected performance for opengl interop? :blink:

spacerat · April 3, 2009, 3:25am

Did you try Cuda 2.2 beta ? Its supposed to have better interop timings.
[url=“http://forums.nvidia.com/lofiversion/index.php?t92416.html”]http://forums.nvidia.com/lofiversion/index.php?t92416.html[/url]

Smokey · April 3, 2009, 3:59am

I have not, no - mainly because I’m not a registered developer, and 3 applications is enough for me to have given up trying.

Simon_Green · April 3, 2009, 9:20am

You only need to register the buffer object once, not every time (this was an error in the documentation).

We are working on direct OpenGL texture interoperability, which should make stuff like this easier and higher performance.

jkuo · April 8, 2009, 7:55pm

I’ve been playing around with the postprocessGL example, and have found similar map/unmap timings. We are developing an application that requires some postprocessing of data for visualization, but on multiple monitors (and thus using two GPU cards). Once I have all the monitors enabled, I find that the mapping/unmapping times shoot up from the original 20-50us to a poky 5-10ms. I can only keep the map/unmap times in the sub-ms range if only one gpu is driving monitors. Is there any way around this? Driving only one video card is not a viable solution for us.

Hippo · April 10, 2009, 11:26pm

I’ve been having similar problems with cudaGLMapBufferObject and cudaGLUnmapBufferObject. For example, in a test application I developed, I allocate a VBO using the following code:

GLuint testVbo;

unsigned int testVboSize = 512*512*sizeof(float)*3;

glGenBuffersARB (1, &testVbo);

glBindBufferARB (GL_ARRAY_BUFFER_ARB, testVbo);

glBufferDataARB (GL_ARRAY_BUFFER_ARB, testVboSize, 0, GL_DYNAMIC_COPY_ARB);

	

glBindBufferARB (GL_ARRAY_BUFFER_ARB, 0);

cudaGLRegisterBufferObject(testVbo);

And then each frame I do the following:

void *test_ptr;

cudaGLMapBufferObject (&test_ptr, testVbo);

cudaGLUnmapBufferObject (testVbo);

I’ve timed the second block of code for the aforementioned VBO size, and the time I get is ~6.82ms. Is this expected? I read elsewhere on the forums that the map/unmap functions sometimes copy memory, but found no indication of what constituted this. Additionally, with a VBO size of 1 byte, it still takes ~0.8ms, which I don’t fully understand - what do these functions actually do?

Could anyone from NVIDIA clarify on why this is happening, or some solution to this problem (if there is one)? I have version 2.1 of the toolkit and SDK and am on Mac OS X 10.5.6 with a GeForce 8600M GT. Is this issue remedied in 2.2?

Smokey · April 16, 2009, 12:14am

I should probably note that it still takes the same amount of time (maybe 50us-100us less overall), even if I only register once.

(not registering each time before mapping simply bumped the map time up to 120us or so, from 50)

Edit: Terminology fix - switched ‘lock’ with ‘map’.

Simon_Green · April 16, 2009, 9:27am

One issue to be aware of is that calling the Map/Unmap functions causes a context switch between the OpenGL and CUDA drivers.

Therefore you should try and group your map calls at the beginning of the frame and the unmap calls at the end of the frame, rather than interleaving them with the CUDA code.

Topic		Replies	Views
OpenGL interop performance ... yes, STILL CUDA Programming and Performance	6	6477	March 29, 2010
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76293	February 14, 2010
CUDA-OpenGL interop performance CUDA Programming and Performance	2	2446	May 30, 2014
A problem of CUDA & OpenGL interoperation CUDA Programming and Performance	4	3951	May 17, 2009
Inefficient CUDA and OpenGL Interop CUDA Programming and Performance	4	2272	December 5, 2012
DX11 <> CUDA interop is slow compared to GL <> CUDA CUDA Programming and Performance	3	3031	January 5, 2020
CUDA/OpenGL interop 'bug'/missing-documentation CUDA Programming and Performance	4	7615	February 6, 2009
CUDA Multi-GPU with OpenGL interop CUDA Programming and Performance	8	13011	December 13, 2010
What would cause of 1-second GPU lockups in CUDA program? How to debug them beyond nvprof? CUDA Programming and Performance	4	777	June 3, 2017
CUDA and Gl interop, deeper understanding Ideas and questions about interop CUDA Programming and Performance	3	1717	July 15, 2009

OpenGL interop performance issues again... (or rather, still...)

Related topics