Using CUDA to preform OpenGL texture readback in an efficient way

I’m a complete CUDA newbie and have been working on a CUDA/CPP plugin to Unity meant to speed up GPU readback of rendered frames and make the data accessible for a separate process.
The plugin works and does run faster than Unity’s readback methods, but I’m wondering if this is as good as it gets.

The current process is as follows:

Initialization goes like this:

  1. An OpenGL texture handle is passed from the engine to my plugin.
  2. The handle is registered using cudaGraphicsGLRegisterImage (specifying a read only flag).
  3. I initialize a shared memory segment on the host (for IPC sharing), aligned to 4096, and page lock the memory using cudaHostRegister.
  4. I save a mapped device pointer to the above using cudaHostGetDevicePointer.

Then for every frame we have:

  1. cudaGraphicsMapResources on the resource returned from 2-Initialization.
  2. cudaGraphicsSubResourceGetMappedArray mapping an array to the resource.
  3. cudaMemcpy2DFromArray from the array to the device pointer received in 4-Initialization (with a DeviceToDevice flag).
  4. cudaGraphicsUnmapResources.

That’s it, I don’t need to do any computation, just get the data beck from the GPU. As mentioned, this works and is 2-3 times faster than Unity’s methods, but is still a bit slow for our use case.

I have the following questions:

  1. Is the above the correct method for high performance copying? I’ve read arguments for using pinned memory without mapping as well as for this (zero-copy?) method. Is one of them superior to the other for this use case?

  2. Would I benefit from paralleling the copy? The images I’m working with are unlikely to be huge (currently benchmarking against 8MB or so).

  3. If the answer to 2 is yes - is it possible to do this using the method I’ve implemented? I’ve been looking at cudaMemcpy2DFromArrayAsync, but as far as I can see, the resource is mapped on a single stream, so I’m not sure how I could go about doing this…

  4. Can I realistically expect to squeeze much better performance by optimizing this process? I’m a newbie to CUDA but from my understanding, it really shines when running multiple computation heavy kernels.

Thanks for reading! Would appreciate any input on the matter, it’s been an exciting journey so far, let’s hope it continues in the same vein.