Inefficient CUDA and OpenGL Interop


I’m attempting to improve the efficiency of CUDA->OpenGL interop in my software. Currently I am running a series of CUDA Kernels and then using the map resources, get mapped pointer, transfer, unmap sequence. As a note, unfortunately I must use a memcpyasync in between the map/get mapped pointer and unmap sections, instead of running a kernel in between. However the memcpyasync only contributes 1-2 ms so it is not my primary worry right now. The large amount of time required by the map, get pointer, unmap sequence though is confusing me. I would think this would be low overhead. See below:

... CUDA Kernels executed here ....

// Map the CudaPboResource to the device
cudaGraphicsMapResources( 1, &imageCudaPBO, stream[0] );
cudaGraphicsResourceGetMappedPointer( (void **) &PBOptr, &byteCounter, imageCudaPBO );

// Copy image from CUDA to OpenGL
cudaMemcpyAsync( (RGBPixel*) PBOptr,( RGBPixel*) pImage, GetFrameSize(), cudaMemcpyDeviceToDevice, stream[0] );

// Unmap buffer object
cudaGraphicsUnmapResources( 1, &imageCudaPBO, stream[0] );

... OpenGL Rendering Done here ....

To map the resource back to OpenGL. The issue is that the above code is taking anywhere from 2-8 ms to complete. I’m attempting to run this code at 60 frames per second ( ~16.66 ms frame times), so this eats up a lot of rendering time from my OpenGL pipeline. Any idea how this could be improved or sped up? Am I missing something here that is fundamental to CUDA/GL interop?

Thanks for your help.

I am not personally familiar with CUDA/OpenGL interop, but it seems to me that since mapping and unmapping are expensive operations (per your measurements) you would want to map once at the start of the app, unmap at the end of the app, and in between just continue to re-use existing mappings. Is that not possible?

Using map/unmap in the rendering loop is the correct way to implement CUDA/GL interop. Please note that the map/unmap calls serve as synchronization points between the two APIs. So when measuring their time you might actually end up measuring the time of previous asynchronous CUDA calls too (e.g. kernels, async memcpys). If you want to time the overhead for the interop you should insert cudaDeviceSynchronize calls to ensure there is no outstanding calls before the calls.

On a side note, OpenGL 4.3 introduces compute shaders which allow for getting rid of the interop overhead completely.

I use this very simply code to update a cudaArray that is bound to a OpenGL texture

cudaGraphicsMapResources(1, &cuda_texture);
cudaArray* memDevice;
cudaGraphicsSubResourceGetMappedArray(&memDevice, cuda_texture, 0, 0);
cudaMemcpyToArray(memDevice, 0, 0, d_in, w*h*sizeof(uchar4), cudaMemcpyDeviceToDevice);
cudaGraphicsUnmapResources(1, &cuda_texture);

This operation takes less than 0.2 ms and is called each time in the OpenGL display function.

My entire “glutMainLoopEvent()” takes roughly 1 ms

Thank you everyone for your responses – I appreciate your help.

Jimmy – I will have to try that approach and see what the performance is like.

tstich – I also look forward to researching this – thanks so much for the suggestion.