Fastest way to copy OpenGL Texture to CPU memory

Hello,

Could anyone confirm that using

cudaHostAlloc(pointer, numBytes, cudaHostAllocWriteCombined)
cudaGraphicsGLRegisterImage( READ_ONLY..)
cudaGraphicsResourceSetMapFlags( READ_ONLY..)
cudaGraphicsMapResources
cudaGraphicsSubResourceGetMappedArray
cudaMemcpy2DFromArray

provides the fastest bandwidth when one wants to copy texture data from GPU memory to CPU memory?

I achieve 2.6 GiB/s with a GTX 1080 (eGPU Thunderbolt 3 on x4 PCIe 3 lanes).

Thanks!

Also, if it is indeed the fastest way, could someone explain what is the limiting factor that explains why the thunderbolt 3 x4 Pcie 3 4GiB/s max bandwidth is not attained?

[1] It is not clear what you measure, how you measure, for what size of transfer. Because transfers have fixed overheads, transfers sizes must be in the MBs to achieve peak throughput. To measure transfers speeds, I suggest to use a framework like this:

cudaDeviceSynchronize()
start high-resolution timer
cudaMemcpy…()
cudaDeviceSynchronize()
stop high-resolution timer

Calibrate this by measuring the overhead (use the same sequence but without the cudaMemcpy). Repeat measurements 10 times. Report the smallest net time (measured time - measured overhead). The high resolution timer should have microsecond granularity or better.

[2] A PCIe gen 3 x4 link maxes out at around 3.1 to 3.2 GB/sec in practice, for the same reason a x16 link maxes out at about 12 to 12.5 GB/sec. PCIe is a packetized interconnect that achieves theoretical transfer rates only at infinite packet size. In actual hardware, the packet size is limited to something like 128 or 256 bytes, so only around 80% of theoretical bandwidth is achievable. 1 GB/s = 1e9 bytes/second.

[3] If host/device transfers are not using asynchronous copies using pinned system memory, there is additional overhead for a system memory copy from/to a pinned buffer maintained by the CUDA driver.

The fastest way to copy an OpenGL texture to CPU memory is almost certainly via an OpenGL API, not via CUDA/OpenGL interop.

Thanks for all your help.

I realize I am not specific enough.

My end goal is to record on disk opengl textures at 60 hertz in 4k 3840 x 2160, rgba (alpha is important in my case), 4 bytes. This is 1.8GiB/s of data.

This is for a pet project which does musical visualization.

I have to do this “live”, I can not reduce the framerate and adjust the recording after.

I am not successful at copying from GPU to CPU memory at 1.8GiB/s.

Here were my attemps :

I am missing 300 MiB/s so I can record live at this resolution without resorting to complex workaround such as ring buffering the GPU memory and/or slowing down the fps and then adjust the recording.

The figure that I’m quoting (2.6 GiB/s) is from the cuda-z benchmark tool : http://cuda-z.sourceforge.net/ which uses the same method I described. Mentioning that figure without context is a bit silly, since I’m not actually achieving 2.6 GiB/s otherwise I’d be happy.

I am not achieving 2.6 GiB/s neither the 3.2 GiB/s (@njuffa), I’m guessing for 2 reasons :

  1. I have 2 opengl apps running at the same time which doesn’t transfer anything from CPU to GPU but is using 40% of the GPU processing power so I’m guessing this impacts COPY performance as well. If someone could confirm that. I can replicate this behavior by opening my 2 opengl apps and cuda-z, cuda-z reports lower bandwidth.
  2. I may not be using the most optimal way yet to copy from GPU to CPU. I am hopping that the cuda-z benchmark tool is not using the most optimal way to copy data and that It could achieve 3.2 GiB/s

[2] A PCIe gen 3 x4 link maxes out at around 3.1 to 3.2 GB/sec in practice

This gives me hope I can achieve my end goal.

[3] If host/device transfers are not using asynchronous copies using pinned system memory, there is additional overhead for a system memory copy from/to a pinned buffer maintained by the CUDA driver.

Are you implying that async copy would be more performant? Right now I am doing sync copy with locked memory. If yes I should try that.

The fastest way to copy an OpenGL texture to CPU memory is almost certainly via an OpenGL API, not via CUDA/OpenGL interop.

Right now, the fastest solution I could find was using CUDA. But maybe it’s because I did not try using locked memory with OpenGL APIs. I would like to try that but with Java on Windows I could not find a way to allocate locked memory other than using CUDA allocation methods.

Thanks for any help on this problem!

Cheers

Note that I wrote 3.1 to 3.2 GB/sec as practical speed-of-light transfer rate for a x4 link. It is customary to state throughput in standard physical units, so ‘G’ for 1e9. On the other hand, capacities are stated in Mebibyte, Gibibyte, Tebibyte.

2.6 GiB/sec = 2.8 GB/sec, so this is already quite close to the practically achievable limit. In my experience, CUDA-Z underreports the achievable device/host transfer rates by about 3%. While the GPU can overlap DMA data transfers with CUDA kernel activity, concurrent OpenGL activity can interfere with the copy performance because OpenGL makes use of the PCIe link (at minimum, for sending commands down to the GPU) and also makes use of the GPU memory. Your experiment in this regard is appropriate and telling:

My advice would be to use more capable hardware, in particular a faster link than PCIe gen 3 x4, and probably a faster GPU overall as well. I approach issues from an engineering perspective. That includes providing sufficient margin in a design for when things go less than perfectly.

Definitely upgrade the hardware ;) cuda-z is reporting 12.1 GiB/s on my RTX 2080ti. lol