Thanks for all your help.
I realize I am not specific enough.
My end goal is to record on disk opengl textures at 60 hertz in 4k 3840 x 2160, rgba (alpha is important in my case), 4 bytes. This is 1.8GiB/s of data.
This is for a pet project which does musical visualization.
I have to do this “live”, I can not reduce the framerate and adjust the recording after.
I am not successful at copying from GPU to CPU memory at 1.8GiB/s.
Here were my attemps :
I am missing 300 MiB/s so I can record live at this resolution without resorting to complex workaround such as ring buffering the GPU memory and/or slowing down the fps and then adjust the recording.
The figure that I’m quoting (2.6 GiB/s) is from the cuda-z benchmark tool : http://cuda-z.sourceforge.net/ which uses the same method I described. Mentioning that figure without context is a bit silly, since I’m not actually achieving 2.6 GiB/s otherwise I’d be happy.
I am not achieving 2.6 GiB/s neither the 3.2 GiB/s (@njuffa), I’m guessing for 2 reasons :
- I have 2 opengl apps running at the same time which doesn’t transfer anything from CPU to GPU but is using 40% of the GPU processing power so I’m guessing this impacts COPY performance as well. If someone could confirm that. I can replicate this behavior by opening my 2 opengl apps and cuda-z, cuda-z reports lower bandwidth.
- I may not be using the most optimal way yet to copy from GPU to CPU. I am hopping that the cuda-z benchmark tool is not using the most optimal way to copy data and that It could achieve 3.2 GiB/s
 A PCIe gen 3 x4 link maxes out at around 3.1 to 3.2 GB/sec in practice
This gives me hope I can achieve my end goal.
 If host/device transfers are not using asynchronous copies using pinned system memory, there is additional overhead for a system memory copy from/to a pinned buffer maintained by the CUDA driver.
Are you implying that async copy would be more performant? Right now I am doing sync copy with locked memory. If yes I should try that.
The fastest way to copy an OpenGL texture to CPU memory is almost certainly via an OpenGL API, not via CUDA/OpenGL interop.
Right now, the fastest solution I could find was using CUDA. But maybe it’s because I did not try using locked memory with OpenGL APIs. I would like to try that but with Java on Windows I could not find a way to allocate locked memory other than using CUDA allocation methods.
Thanks for any help on this problem!