OpenGL interop performance

There seems to be a lot of unclear information regards how to efficiently do cuda/opengl interop for textures/surfaces. There are unanswered questions on this forum and unclear or wrong answers elsewhere on this issue.

Right now I have something like:

cudaArray* arrayPtr;

CHECK_CUDA(cudaGraphicsMapResources(1, &cudaRegisteredTexture, 0));
CHECK_CUDA(cudaGraphicsSubResourceGetMappedArray(&arrayPtr, cudaRegisteredTexture, 0, 0));

cudaSurfaceObject_t surface;
cudaResourceDesc surfaceDetails{};

surfaceDetails.resType = cudaResourceType::cudaResourceTypeArray;
surfaceDetails.res.array.array = arrayPtr;

CHECK_CUDA(cudaCreateSurfaceObject(&surface, &surfaceDetails));

RenderOptixRaytracingAndCopyToSurface(surface);

CHECK_CUDA(cudaDestroySurfaceObject(surface));
CHECK_CUDA(cudaGraphicsUnmapResources(1, &cudaRegisteredTexture, 0));

This has some performance hiccups when run every frame - more than you would expect. From the documentation cudaGraphicsSubResourceGetMappedArray() might not always return the same value, and even then its not clear if it would be safe to cache the surface object if so (and recreate the surface object if it changes).

It might make sense to have multiple mappable textures that are cycled between if there is such observed stalls, but nowhere mentions this.

I’d respectfully suggest the cuda documents need a bit more information regarding the best practices here.

After much digging I came across this old document:

I think some of that info should be summarised in the main Cuda documents.

You’ve perhaps already seen this, it’s a little more recent: