CUDA and Gl interop, deeper understanding Ideas and questions about interop

Necromantique · July 13, 2009, 1:43pm

I am working on an application that mixes OpenGL and CUDA to do processing of the data in a game renderer.

I would like to get an idea of the internal mechanisms behind the interop so I can get a deeper understanding of the process so it is easier to assess the performance of different approaches than just trying to benchmark an exponential set of combinations or imitating the CUDA samples.
I’m simplifying things a bit here, and assume that engine-driver syncs are not an issue (ie. Bindless Graphics & Co.)

OpenGL seems to have less interoperability than DirectX but this is just an illusion since DirectX just copy things to buffers internally and with OpenGL youâ€™re in control of everything. â€“ Is this correct?
Do the drivers have a single shared command buffer where both CUDA and GL append to? So while cuda calls mixed with GL calls may be expensive (as if they did a bunch of shader changes and rendertarget changes, …) they do not synchronize with the CPU?
I heard about driver â€œContext Switchesâ€ when switching APIâ€™s. Do all calls cause these or just a few (maybe something that sets a deferred state, say glClearColor or something doesn’t cause them) maybe only kernel launches and swapbuffers cause these or only when buffers are full etc etc how does this work?

3bis) The cudaGLMapBufferObject/cudaGLUnmapBufferObject is a cuda or an openGl context call? I.e. do I mix it with my opengl calls to then later execute a bunch of cuda calls on the mapped buffers or do I mix it with the cuda calls?

cudaGLMapBufferObject returns a pointer. Is this just a simple lookup in the driverâ€™s internal â€œglobjects[id]->device_ptrâ€ or does this actually force a gpu>cpu flush+sync before it returns the address.
Since it just returns a pointer to the data and not the actual data it could in theory return the pointer before the gl calls writing to the buffer have completed. And you could then queue kernel calls which will eventually use the data… and which we eventually (maybe next frame) read back, nice and fast ^_^
If the cudaGLMapBufferObject does wait for data to be available, is there any benifit to PBO style “double buffering” rendering to another buffer than the one I read from + swapping. Or does a context switch force all GL calls to finish
Ok, pushing my luck here - Is there any chance I could do a few GetBufferParameterui64vNV( BUFFER_GPU_ADDRESS) + MakeBufferResidentNV calls once and pass the addresses to CUDA or is this basically what cudaGLMapBufferObject does anyway (seems not since it allows write access)

Maybe this could be something to add to the best practices guide or other docs. It seems to be a bit of a grey zone…

anand · July 13, 2009, 4:46pm

While we are on the topic of OpenGL and CUDA interoperability, are there matrix operators in CUDA for the standard matrix sizes used in graphics (eg: mat4mat4, mat3mat3 etc)? Note that I don’t much care for a parallel heavyweight such as CUBLAS for this, wondering if there is a native CUDA operation for standard graphics matrix/vector sizes.

Thanks!
-Anand

anand · July 13, 2009, 4:53pm

Looks like the native float3 and float4’s should work for vectors. Matrices (like float3x3, float4x4)?

Necromantique · July 15, 2009, 8:52am

You should just implement your own classes. The hardware is scalar even the “build-in” float4’s are just implemented as a struct (see vector_types.h and vector_functions.h in the include folder to get some ideas…).

So as to stay on topic, does anybody have any clarifications for my OpenGL related questions outlined above?

Topic		Replies	Views
CUDA Multi-GPU with OpenGL interop CUDA Programming and Performance	8	12980	December 13, 2010
DX11 <> CUDA interop is slow compared to GL <> CUDA CUDA Programming and Performance	3	3017	January 5, 2020
CUDA-OpenGL interop performance CUDA Programming and Performance	2	2432	May 30, 2014
OpenGL interop performance ... yes, STILL CUDA Programming and Performance	6	6464	March 29, 2010
OpenGL interop performance issues again... (or rather, still...) CUDA Programming and Performance	7	2452	April 16, 2009
GL interop in multithreaded host app CUDA Programming and Performance	5	12110	December 19, 2011
CUDA/OpenGL interop 'bug'/missing-documentation CUDA Programming and Performance	4	7615	February 6, 2009
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76280	February 14, 2010
CUDA-OpenGL Interop - CUDA API CUDA Programming and Performance cuda , opengl	3	1400	April 10, 2023
What would cause of 1-second GPU lockups in CUDA program? How to debug them beyond nvprof? CUDA Programming and Performance	4	775	June 3, 2017

CUDA and Gl interop, deeper understanding Ideas and questions about interop

Related topics