Fastest way to copy OpenGL Texture to CPU memory

sempereademfr · November 15, 2019, 3:54pm

Hello,

Could anyone confirm that using

cudaHostAlloc(pointer, numBytes, cudaHostAllocWriteCombined)
cudaGraphicsGLRegisterImage( READ_ONLY..)
cudaGraphicsResourceSetMapFlags( READ_ONLY..)
cudaGraphicsMapResources
cudaGraphicsSubResourceGetMappedArray
cudaMemcpy2DFromArray

provides the fastest bandwidth when one wants to copy texture data from GPU memory to CPU memory?

I achieve 2.6 GiB/s with a GTX 1080 (eGPU Thunderbolt 3 on x4 PCIe 3 lanes).

Thanks!

sempereademfr · November 15, 2019, 4:29pm

Also, if it is indeed the fastest way, could someone explain what is the limiting factor that explains why the thunderbolt 3 x4 Pcie 3 4GiB/s max bandwidth is not attained?

njuffa · November 15, 2019, 5:15pm

[1] It is not clear what you measure, how you measure, for what size of transfer. Because transfers have fixed overheads, transfers sizes must be in the MBs to achieve peak throughput. To measure transfers speeds, I suggest to use a framework like this:

cudaDeviceSynchronize()
start high-resolution timer
cudaMemcpy…()
cudaDeviceSynchronize()
stop high-resolution timer

Calibrate this by measuring the overhead (use the same sequence but without the cudaMemcpy). Repeat measurements 10 times. Report the smallest net time (measured time - measured overhead). The high resolution timer should have microsecond granularity or better.

[2] A PCIe gen 3 x4 link maxes out at around 3.1 to 3.2 GB/sec in practice, for the same reason a x16 link maxes out at about 12 to 12.5 GB/sec. PCIe is a packetized interconnect that achieves theoretical transfer rates only at infinite packet size. In actual hardware, the packet size is limited to something like 128 or 256 bytes, so only around 80% of theoretical bandwidth is achievable. 1 GB/s = 1e9 bytes/second.

[3] If host/device transfers are not using asynchronous copies using pinned system memory, there is additional overhead for a system memory copy from/to a pinned buffer maintained by the CUDA driver.

Robert_Crovella · November 15, 2019, 5:39pm

The fastest way to copy an OpenGL texture to CPU memory is almost certainly via an OpenGL API, not via CUDA/OpenGL interop.

sempereademfr · November 16, 2019, 4:17pm

Thanks for all your help.

I realize I am not specific enough.

My end goal is to record on disk opengl textures at 60 hertz in 4k 3840 x 2160, rgba (alpha is important in my case), 4 bytes. This is 1.8GiB/s of data.

This is for a pet project which does musical visualization.

I have to do this “live”, I can not reduce the framerate and adjust the recording after.

I am not successful at copying from GPU to CPU memory at 1.8GiB/s.

Here were my attemps :

1. use glTexSubImage2D → 400 MiB/s
1. use a PBO with glTexSubImage2D & glMapBuffer, using an improved memcpy (https://github.com/leadedge/Spout2/blob/master/SpoutSDK/Source/SpoutCopy.cpp#L136) → 640 MiB/s
1. use CUDA as explained in my post (since the benchmark results looked promising), I allocate 1.8 GiB for each texture using cudaHostAlloc → I achieve 1.5 GiB/s

I am missing 300 MiB/s so I can record live at this resolution without resorting to complex workaround such as ring buffering the GPU memory and/or slowing down the fps and then adjust the recording.

The figure that I’m quoting (2.6 GiB/s) is from the cuda-z benchmark tool : http://cuda-z.sourceforge.net/ which uses the same method I described. Mentioning that figure without context is a bit silly, since I’m not actually achieving 2.6 GiB/s otherwise I’d be happy.

I am not achieving 2.6 GiB/s neither the 3.2 GiB/s (@njuffa), I’m guessing for 2 reasons :

I have 2 opengl apps running at the same time which doesn’t transfer anything from CPU to GPU but is using 40% of the GPU processing power so I’m guessing this impacts COPY performance as well. If someone could confirm that. I can replicate this behavior by opening my 2 opengl apps and cuda-z, cuda-z reports lower bandwidth.
I may not be using the most optimal way yet to copy from GPU to CPU. I am hopping that the cuda-z benchmark tool is not using the most optimal way to copy data and that It could achieve 3.2 GiB/s

[2] A PCIe gen 3 x4 link maxes out at around 3.1 to 3.2 GB/sec in practice

This gives me hope I can achieve my end goal.

[3] If host/device transfers are not using asynchronous copies using pinned system memory, there is additional overhead for a system memory copy from/to a pinned buffer maintained by the CUDA driver.

Are you implying that async copy would be more performant? Right now I am doing sync copy with locked memory. If yes I should try that.

The fastest way to copy an OpenGL texture to CPU memory is almost certainly via an OpenGL API, not via CUDA/OpenGL interop.

Right now, the fastest solution I could find was using CUDA. But maybe it’s because I did not try using locked memory with OpenGL APIs. I would like to try that but with Java on Windows I could not find a way to allocate locked memory other than using CUDA allocation methods.

Thanks for any help on this problem!

Cheers

njuffa · November 16, 2019, 6:40pm

Note that I wrote 3.1 to 3.2 GB/sec as practical speed-of-light transfer rate for a x4 link. It is customary to state throughput in standard physical units, so ‘G’ for 1e9. On the other hand, capacities are stated in Mebibyte, Gibibyte, Tebibyte.

2.6 GiB/sec = 2.8 GB/sec, so this is already quite close to the practically achievable limit. In my experience, CUDA-Z underreports the achievable device/host transfer rates by about 3%. While the GPU can overlap DMA data transfers with CUDA kernel activity, concurrent OpenGL activity can interfere with the copy performance because OpenGL makes use of the PCIe link (at minimum, for sending commands down to the GPU) and also makes use of the GPU memory. Your experiment in this regard is appropriate and telling:

My advice would be to use more capable hardware, in particular a faster link than PCIe gen 3 x4, and probably a faster GPU overall as well. I approach issues from an engineering perspective. That includes providing sufficient margin in a design for when things go less than perfectly.

matthewpottinger · October 17, 2020, 3:00am

Definitely upgrade the hardware ;) cuda-z is reporting 12.1 GiB/s on my RTX 2080ti. lol

Topic		Replies	Views
CUDA and OpenGL data transfer CUDA Programming and Performance	9	21291	October 6, 2007
About Data transfer speed between CPU and GPU? How to increase the data transfer speed? CUDA Programming and Performance	7	15528	December 11, 2009
The change of speed when copying data between host and device CUDA Programming and Performance pcie , cuda , linux	5	1924	October 12, 2021
Howto efficiently copy texdata from OpenGL to CUDA CUDA Programming and Performance	4	2828	March 5, 2008
doubts about transferring/mapping framebuffer textures to cuda space CUDA Programming and Performance	3	2798	March 23, 2010
3D device-to-device memcopy to cudaArray slow? CUDA Programming and Performance	8	11652	January 14, 2010
Pass openGL data to CUDA. Question about speed. CUDA Programming and Performance	4	1874	August 22, 2016
Bandwidth is too slow so cudaMemcpy() takes too long. CUDA Programming and Performance	15	7514	December 12, 2012
The speed of data transfer between GPU and CPU CUDA Programming and Performance	4	2628	April 27, 2009
Improving data transfer performance from host to device CUDA Programming and Performance	2	2058	January 28, 2015

Fastest way to copy OpenGL Texture to CPU memory

Related topics