CUDA and Gl interop, deeper understanding Ideas and questions about interop

I am working on an application that mixes OpenGL and CUDA to do processing of the data in a game renderer.

I would like to get an idea of the internal mechanisms behind the interop so I can get a deeper understanding of the process so it is easier to assess the performance of different approaches than just trying to benchmark an exponential set of combinations or imitating the CUDA samples.
I’m simplifying things a bit here, and assume that engine-driver syncs are not an issue (ie. Bindless Graphics & Co.)

  1. OpenGL seems to have less interoperability than DirectX but this is just an illusion since DirectX just copy things to buffers internally and with OpenGL you’re in control of everything. – Is this correct?

  2. Do the drivers have a single shared command buffer where both CUDA and GL append to? So while cuda calls mixed with GL calls may be expensive (as if they did a bunch of shader changes and rendertarget changes, …) they do not synchronize with the CPU?

  3. I heard about driver “Context Switches” when switching API’s. Do all calls cause these or just a few (maybe something that sets a deferred state, say glClearColor or something doesn’t cause them) maybe only kernel launches and swapbuffers cause these or only when buffers are full etc etc how does this work?

3bis) The cudaGLMapBufferObject/cudaGLUnmapBufferObject is a cuda or an openGl context call? I.e. do I mix it with my opengl calls to then later execute a bunch of cuda calls on the mapped buffers or do I mix it with the cuda calls?

  1. cudaGLMapBufferObject returns a pointer. Is this just a simple lookup in the driver’s internal “globjects[id]->device_ptr” or does this actually force a gpu>cpu flush+sync before it returns the address.
    Since it just returns a pointer to the data and not the actual data it could in theory return the pointer before the gl calls writing to the buffer have completed. And you could then queue kernel calls which will eventually use the data… and which we eventually (maybe next frame) read back, nice and fast ^_^

  2. If the cudaGLMapBufferObject does wait for data to be available, is there any benifit to PBO style “double buffering” rendering to another buffer than the one I read from + swapping. Or does a context switch force all GL calls to finish

  3. Ok, pushing my luck here - Is there any chance I could do a few GetBufferParameterui64vNV( BUFFER_GPU_ADDRESS) + MakeBufferResidentNV calls once and pass the addresses to CUDA or is this basically what cudaGLMapBufferObject does anyway (seems not since it allows write access)

Maybe this could be something to add to the best practices guide or other docs. It seems to be a bit of a grey zone…

While we are on the topic of OpenGL and CUDA interoperability, are there matrix operators in CUDA for the standard matrix sizes used in graphics (eg: mat4mat4, mat3mat3 etc)? Note that I don’t much care for a parallel heavyweight such as CUBLAS for this, wondering if there is a native CUDA operation for standard graphics matrix/vector sizes.

Thanks!
-Anand

Looks like the native float3 and float4’s should work for vectors. Matrices (like float3x3, float4x4)?

You should just implement your own classes. The hardware is scalar even the “build-in” float4’s are just implemented as a struct (see vector_types.h and vector_functions.h in the include folder to get some ideas…).

So as to stay on topic, does anybody have any clarifications for my OpenGL related questions outlined above?