CUDA/OpenGL interop 'bug'/missing-documentation

Vista (32bit)
Driver Version: 181.20
CUDA Version: 2.0 / 2.1

Either cuGLUnmapBufferObject or cuGLUnregisterBufferObject leaves the PBO bound to the GL_ARRAY_BUFFER OpenGL buffer.

I’m not sure if this is ‘intended’ (although that would be pretty stupid, considering it would completely break VBO/PBO-un-aware applications that still use glVertexPointer and the likes), or you guys simply forgot the documentation (again -_-), but the fact remains certain driver versions do this, and OpenGL developers ‘need’ to know - and/or you should be cleaning up after yourselves.

I wasted 12 working hours thinking my OpenGL code did something ‘strange’ that somehow broke my OpenGL rendering code on Vista only - finally to realise that despite the fact I wasn’t using VBOs, CUDA was binding buffers to that buffer id/slot/whatever… thus breaking my vertex-pointer / glDrawElements code in cases where I wasn’t using VBOs for one reason or another.

Cheers,

P.S. related opengl.org forum link to my problem, for those who’re interested

I have good news and bad news. First, the bad: indeed this is a bug. It actually happens in cuGLMapBufferObject. I am very sorry you had to waste your time with this.

The good news is, this happens only in the fallback path. For some reason you are missing the fast path, and if we can fix that, you won’t be affected by the bug. As a bonus, you will see better performance during the map/unmap operations!

I see you are running on Vista. I’m guessing you are calling cuCtxCreate (or cudaSetDevice) instead of cuGLCtxCreate (or cudaGLSetGLDevice). Due to restrictions in the Vista operating system, you must use the GL versions of these functions for high performance data sharing between GL and CUDA.

This bug will be fixed in a future driver release. Here’s a portable workaround, which should be unnecessary in the general case:

[indent]GLuint currentBuffer;

glGetIntegerv(GL_ARRAY_BUFFER_BINDING, &currentBuffer);

cuGLMapBufferObject(…);

glBindBuffer(GL_ARRAY_BUFFER, currentBuffer);[/indent]

On that note, none of the cuGL* functions have documentation stating an OpenGL context must be active - and while it’s obvious for commands such as Register/Map/Unmap/Unregister, it’s not as obvious for cuGLCtxCreate/cuGLInit (which also appear to have this requirement).

Question,

What’s the difference between cuGLCtxCreate, and cuCtxCreate followed by cuGLInit?

(We’re currently using the latter method, creating the context normally - and if/when OpenGL interop is required, we use cuGLInit)

Because it’s not overly practical for us to use cuGLCtxCreate (as our OpenGL contexts/etc aren’t created for quite some time after the CUDA stuff is created, not to mention the fact we can’t force all our other apps to have an OpenGL context just to use CUDA)…

Unless there’s a more flexible/practical way to allocate contexts that I’m not aware of - such that we could easily have one context per ‘kernel instance’?

We’re currently creating one context per device, globally - due to the fact that we share memory between kernels quite a lot, and I’m not aware of any method to make an address from CtxA valid/relevant to CtxB (due to different memory address spaces between contexts) without transferring memory via the host (which is of course, very slow).

Any advice would be greatly appreciated,

Cheers.

Edit: For what it’s worth, I managed to implement a way for applications to initialise CUDA with contexts created by cuGLCtxCreate, and it did bypass the bug as you said - however in Vista, I measured no noticeable difference in speed between your ‘fast’ path and ‘slow’ path… or rather, if there is any difference in speed - it’s a matter of a couple of hundred us at the ‘very’ most for an entire frame of my application (and I map/unmap two buffers per frame, among other things that may have entered a fast path… so I’m guessing no more than 50us speed up per function, which is nice I guess) - oddly I did see my render times increase from 3-4ms (yay Vista, XP barely takes 800us) to about 5-6, I’m not sure if this is related though (other changes have been made).

The Vista driver model imposes limitations on our ability to share objects even within the driver. cuGLCtxCreate works around this by creating the CUDA context under the same umbrella as the GL driver - something we cannot do generically since not all CUDA apps use GL. There is a similar requirement for DX interop. The latter method you describe works for WinXP / Linux but not for Vista. When we hit the fallback path, as will happen if you don’t use the GL version of the command, mapping and unmapping a GL resource currently forces a memcpy through host memory. We’re looking at ways to improve this.

As I’m primarily focused on the CUDA/GL interop, I’m hopeful someone else can offer some guidance on the rest of your questions.

Interop performance generally affects the time to map and unmap the resource only. If your frame time is dominated by other operations, making interop faster won’t show much improvement. One thing to watch for is the frequency of context switches between GL and CUDA. Conceptually, cuGLMapBufferObject starts in GL and ends in CUDA, and vice versa for cuGLUnmapBufferObject. So ensuring that all GL operations are outside the map/unmap and all CUDA operations are between map/unmap will generally give the best result. Depending on the application and the chip, the difference can be surprising. YMMV.