OpenGL Interoperability Copying from OpenGL To Cuda

I figured that parts of my code perform way better on classic-GPGPU using OpenGL (mainly it’s because I am making use of the blend stage).

So I used the regular OpenGL render to texture mechanism using FBOs to perform the OpenGL part of the computation. After that I use Buffer Objects to map these OpenGL textures into Cuda’s global memory (i.e. glBindBuffer, glReadPixels). However, this seems to break down performance a lot and I anticipate that it is due to the glReadPixels call.

So I was wondering, if there is a way of avoiding the read pixels thing and using OpenGL textures within a Cuda context right away?

Not currently. You can’t read directly from OpenGL textures in CUDA.

See the FAQ q8:

Thanks for the quick response.

Do you have any idea if it will be possible soon? (“Not currently” makes me think positively)

Or is there any possibility to leverage the blending stage directly from CUDA?

My problem is that I have a super large array from where I read and another to where I write. However reads and writes are completely random and writes might occur several times for one and the same memory location, thus race conditions can occur if you don’t create a proper schedule ahead of time (on the CPU). The blendstage however is able to deal with these conditions for free (well if it wasn’t for the memory copy futher down my pipe…)

Not anytime soon. It’s not in CUDA 1.1, certainly.

Note that in CUDA you could implement the equivalent of blending by doing a read/modify/write to global memory (as long as each thread is only writing to its own location).

The thing is… blending is atomic. However, on 8800, CUDA doesn’t support atomic ops yet. For things like dynamic sparse matrix ~ static dense vector mul, it’s really really handy.
Also, it’s sometimes easier to do list operations using geometry shaders, especially for one-to-many maps.
I’m longing for the day GL and CUDA can be used interleavedly without memcpy…

I also have some list operations which I will probably implement using GS, as well, and see how that performs…

In our experience list operations are best implemented in data-parallel fashion using CUDA (using scan etc.), rather than using the GS.

That’s good to hear. Can I cite this later in my paper?