I’ve been investigating a GPU processing bubble regarding a Vulkan Application.
For a while I’ve been experiencing some latency & extended synchronization regarding the interaction of Vulkan with the presentation engine.
This is common both when running on Windows & Linux Wayland.
The workflow is as follows:
Standard 2 Frames-in-flight, while GPU processes frame N, CPU is recording frame N + 1.
CPU frame fence only blocks until frame N - 1 is signaled, so resources can be re-used.
Frame Job (N):
- AcquireNextImage - signals acquire_semaphore.
- VkQueueSubmit - signals submit_semaphore.
- submit[0] - Graphics Job - doesn’t wait on semaphore - internal barrier synchronization.
- submit[1] - Swapchain Blit - waits on acquire_semaphore.
- VkQueuePresent - waits on submit_semaphore.
Immediate mode presentation (mailbox mode exhibits same pattern).
Based on the GPU Trace, I can see two main issues.
The first, and most obvious issue, is the long synchronization wait for submit[1]. While it waits the signal from acquire_semaphore, indicating last presentation to this image has concluded.
The second potential issue, is what appears to be a lack of parallel execution between the graphics job and the presentation. I have moved the presentation to a different queue both in the same family and to other queue families, yet I’m never able to overlap both things. I wonder if it’s even possible to do so.
There’s also a large idle time until the “unattributed” context (which I imagine is swapchain compositor), however this might be related with synchronization requirements by the OS, and I doubt I would be able to address that, please tell me if I’m wrong here.
Since a lot of time is wasted simply waiting for the swapchain image acquisition, I figured increasing the number of swapchain images could potentially allow for less time waiting on synchronization. However unintuitively, it didn’t. Even increasing to a large number of swapchain images, the acquire_semaphore waiting time stays the same. Which, I can’t seem to explain.
Would appreciate any pointers / suggestions regarding this. Let me know if further information is required, I would share the code, but I feel like it’s heavily abstracted and not super easy to parse. So it would be counter productive.
Thank you.
