I am working on an application that mixes OpenGL and CUDA to do processing of the data in a game renderer.
I would like to get an idea of the internal mechanisms behind the interop so I can get a deeper understanding of the process so it is easier to assess the performance of different approaches than just trying to benchmark an exponential set of combinations or imitating the CUDA samples.
I’m simplifying things a bit here, and assume that engine-driver syncs are not an issue (ie. Bindless Graphics & Co.)
OpenGL seems to have less interoperability than DirectX but this is just an illusion since DirectX just copy things to buffers internally and with OpenGL youâ€™re in control of everything. â€“ Is this correct?
Do the drivers have a single shared command buffer where both CUDA and GL append to? So while cuda calls mixed with GL calls may be expensive (as if they did a bunch of shader changes and rendertarget changes, …) they do not synchronize with the CPU?
I heard about driver â€œContext Switchesâ€ when switching APIâ€™s. Do all calls cause these or just a few (maybe something that sets a deferred state, say glClearColor or something doesn’t cause them) maybe only kernel launches and swapbuffers cause these or only when buffers are full etc etc how does this work?
3bis) The cudaGLMapBufferObject/cudaGLUnmapBufferObject is a cuda or an openGl context call? I.e. do I mix it with my opengl calls to then later execute a bunch of cuda calls on the mapped buffers or do I mix it with the cuda calls?
cudaGLMapBufferObject returns a pointer. Is this just a simple lookup in the driverâ€™s internal â€œglobjects[id]->device_ptrâ€ or does this actually force a gpu>cpu flush+sync before it returns the address.
Since it just returns a pointer to the data and not the actual data it could in theory return the pointer before the gl calls writing to the buffer have completed. And you could then queue kernel calls which will eventually use the data… and which we eventually (maybe next frame) read back, nice and fast ^_^
If the cudaGLMapBufferObject does wait for data to be available, is there any benifit to PBO style “double buffering” rendering to another buffer than the one I read from + swapping. Or does a context switch force all GL calls to finish
Ok, pushing my luck here - Is there any chance I could do a few GetBufferParameterui64vNV( BUFFER_GPU_ADDRESS) + MakeBufferResidentNV calls once and pass the addresses to CUDA or is this basically what cudaGLMapBufferObject does anyway (seems not since it allows write access)
Maybe this could be something to add to the best practices guide or other docs. It seems to be a bit of a grey zone…