Inefficient CUDA and OpenGL Interop

agras · December 4, 2012, 4:44pm

All,

I’m attempting to improve the efficiency of CUDA->OpenGL interop in my software. Currently I am running a series of CUDA Kernels and then using the map resources, get mapped pointer, transfer, unmap sequence. As a note, unfortunately I must use a memcpyasync in between the map/get mapped pointer and unmap sections, instead of running a kernel in between. However the memcpyasync only contributes 1-2 ms so it is not my primary worry right now. The large amount of time required by the map, get pointer, unmap sequence though is confusing me. I would think this would be low overhead. See below:

... CUDA Kernels executed here ....

// Map the CudaPboResource to the device
cudaGraphicsMapResources( 1, &imageCudaPBO, stream[0] );
cudaGraphicsResourceGetMappedPointer( (void **) &PBOptr, &byteCounter, imageCudaPBO );

// Copy image from CUDA to OpenGL
cudaMemcpyAsync( (RGBPixel*) PBOptr,( RGBPixel*) pImage, GetFrameSize(), cudaMemcpyDeviceToDevice, stream[0] );

// Unmap buffer object
cudaGraphicsUnmapResources( 1, &imageCudaPBO, stream[0] );

... OpenGL Rendering Done here ....

To map the resource back to OpenGL. The issue is that the above code is taking anywhere from 2-8 ms to complete. I’m attempting to run this code at 60 frames per second ( ~16.66 ms frame times), so this eats up a lot of rendering time from my OpenGL pipeline. Any idea how this could be improved or sped up? Am I missing something here that is fundamental to CUDA/GL interop?

Thanks for your help.

njuffa · December 4, 2012, 6:31pm

I am not personally familiar with CUDA/OpenGL interop, but it seems to me that since mapping and unmapping are expensive operations (per your measurements) you would want to map once at the start of the app, unmap at the end of the app, and in between just continue to re-use existing mappings. Is that not possible?

tstich · December 5, 2012, 8:53am

Using map/unmap in the rendering loop is the correct way to implement CUDA/GL interop. Please note that the map/unmap calls serve as synchronization points between the two APIs. So when measuring their time you might actually end up measuring the time of previous asynchronous CUDA calls too (e.g. kernels, async memcpys). If you want to time the overhead for the interop you should insert cudaDeviceSynchronize calls to ensure there is no outstanding calls before the calls.

On a side note, OpenGL 4.3 introduces compute shaders which allow for getting rid of the interop overhead completely.

Jimmy_Pettersson · December 5, 2012, 1:01pm

I use this very simply code to update a cudaArray that is bound to a OpenGL texture

cudaGraphicsMapResources(1, &cuda_texture);
cudaArray* memDevice;
cudaGraphicsSubResourceGetMappedArray(&memDevice, cuda_texture, 0, 0);
		
		
cudaMemcpyToArray(memDevice, 0, 0, d_in, w*h*sizeof(uchar4), cudaMemcpyDeviceToDevice);
		
cudaGraphicsUnmapResources(1, &cuda_texture);

This operation takes less than 0.2 ms and is called each time in the OpenGL display function.

My entire “glutMainLoopEvent()” takes roughly 1 ms

agras · December 5, 2012, 4:20pm

Thank you everyone for your responses – I appreciate your help.

Jimmy – I will have to try that approach and see what the performance is like.

tstich – I also look forward to researching this – thanks so much for the suggestion.

Topic		Replies	Views
OpenGL interop performance issues again... (or rather, still...) CUDA Programming and Performance	7	2455	April 16, 2009
cudaGraphicsUnmapResources performance overhead CUDA Programming and Performance	0	178	April 18, 2024
OpenGL interop performance ... yes, STILL CUDA Programming and Performance	6	6477	March 29, 2010
OpenGL interop performance CUDA Programming and Performance	2	45	December 11, 2024
A problem of CUDA & OpenGL interoperation CUDA Programming and Performance	4	3951	May 17, 2009
CUDA-OpenGL interop performance CUDA Programming and Performance	2	2446	May 30, 2014
cudaGraphicsMapResources each frame or just once when cuda-opengl interop ? which better? CUDA Programming and Performance	7	398	December 6, 2023
OpenGL & CUDA interop with surfaces slow... CUDA Programming and Performance	2	956	July 6, 2018
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76293	February 14, 2010
OpenGL interop performance problems CUDA Programming and Performance	2	1322	February 2, 2010

Inefficient CUDA and OpenGL Interop

Related topics