We are facing an issue when using Vulkan on WIndows in a shared environment (more than one application running), and our application is heavy on texture-streaming. (Video transcoding / stitching with an asymmetric array of decoding and rendering GPUs, with full-duplex transfers requiring at least 50% theoretical peak throughput, hard real time requirements.)
The problem we are facing, is how the DMA (copy) engines are mapped to queues to in Vulkan.
When we use CUDA, transfers are properly sorted onto different engines, sorted by transfer direction, achieving full duplex operation on the constraining PCIe bus.
For the transfer queues exposed via Vulkan API, this doesn’t work at all. At the time of allocation of the transfer queue, it’s hard mapped to the first DMA engine only. This is devastating for performance, as it isn’t only limited to half duplex transfer now, but also conflicts with the DMA throughput achivable with CUDA on the same system, as CUDA had reserved that DMA engine for device-to-host transfers only. Effectively making the use of the Vulkan API crashing throughput system-wide.
We are hitting the worst case scenarios, where texture uploads via Vulkan are running in parallel with buffer uploads via CUDA on a different DMA engine, effectively quadrupling latency on the first DMA engine due to first stalling both engines, and then additionally not being able to perform downstream transfers at all while first engine is stalled.
We require a way to specifically target the DMA engines most suitable for a specific transfer direction. While we can efficiently batch/queue transfers by direction on application side, we must be able to either distinguish the transfer-only queues mapping to the most appropriate DMA engine, or the application side half of the driver must properly dispatch the command buffer to the most appropriate DMA engine (rather than queuing everything onto the same DMA engine without consideration).