RTX4000 (Ada) Copy Engine / Bitblt throughput from CPU

Have been doing some testing on a full PCIe4 system with independent x16 lanes to CPU. i.e. no bus switches.

Could anyone comment on what is a realistic expectation for how much raw (uncompressed, so no nvdec usage) rgb pixel data can actually be sent via DirectX in the best case scenario for rendering on Windows desktop? Assume 4 x 4Kp60 monitors, extended desktop, mosaic or not.

In our tests, feeding the GPU with 8 x 4Kp60 rgba sources, scaling those to 1920x1080 appears to well exceed the board’s capacity - copy engine maxing at 100%. This is well below what PCIe4 itself can handle. (And we have verified we are nowhere near bus maxout).

Also interested in how much of a drag on overall performance there would be from DWM involvement in all of this, i.e. whether we get much extra throughput using the more recent Win11 MPO gaming optimisation modes that effectively put the app into full-screen mode and bypass any DWM compositing if the window covers the whole monitor.

Thanks MT