GPU Idle synchronization & parallel presentation - Vulkan Application

I’ve been investigating a GPU processing bubble regarding a Vulkan Application.
For a while I’ve been experiencing some latency & extended synchronization regarding the interaction of Vulkan with the presentation engine.
This is common both when running on Windows & Linux Wayland.


The workflow is as follows:

Standard 2 Frames-in-flight, while GPU processes frame N, CPU is recording frame N + 1.
CPU frame fence only blocks until frame N - 1 is signaled, so resources can be re-used.

Frame Job (N):

  • AcquireNextImage - signals acquire_semaphore.
  • VkQueueSubmit - signals submit_semaphore.
    • submit[0] - Graphics Job - doesn’t wait on semaphore - internal barrier synchronization.
    • submit[1] - Swapchain Blit - waits on acquire_semaphore.
  • VkQueuePresent - waits on submit_semaphore.

Immediate mode presentation (mailbox mode exhibits same pattern).


Based on the GPU Trace, I can see two main issues.

The first, and most obvious issue, is the long synchronization wait for submit[1]. While it waits the signal from acquire_semaphore, indicating last presentation to this image has concluded.

The second potential issue, is what appears to be a lack of parallel execution between the graphics job and the presentation. I have moved the presentation to a different queue both in the same family and to other queue families, yet I’m never able to overlap both things. I wonder if it’s even possible to do so.

There’s also a large idle time until the “unattributed” context (which I imagine is swapchain compositor), however this might be related with synchronization requirements by the OS, and I doubt I would be able to address that, please tell me if I’m wrong here.


Since a lot of time is wasted simply waiting for the swapchain image acquisition, I figured increasing the number of swapchain images could potentially allow for less time waiting on synchronization. However unintuitively, it didn’t. Even increasing to a large number of swapchain images, the acquire_semaphore waiting time stays the same. Which, I can’t seem to explain.

Would appreciate any pointers / suggestions regarding this. Let me know if further information is required, I would share the code, but I feel like it’s heavily abstracted and not super easy to parse. So it would be counter productive.

Thank you.

Just thought I should add.

There aren’t any validation errors. No artifacts, or apparent synchronization problems.

The image index is different for every frame. So it’s not selecting the same image for multiple frames, which would explain the long wait for synchronization, since the image could potentially still being used by the presentation engine.

Every semaphore is unsignaled, and only used when there’s a fence guarantee that it’s previous usage has completed.

I believe I’ve made some progress in understanding the cause for the long swapchain acquisition waits, and also why increasing the number images doesn’t eliminate or lowers the time spent waiting for synchronization.

It has to do with the vulkan/presentation engine using different contexts, and when/why these context switches happen.


Because the CPU is able to record frames faster than the GPU can consume the jobs, once the GPU finishes processing frame N, the frame N + 1 is already queued. If the image acquisition doesn’t force a stall, the next frame can begin immediately.

The driver chooses to remain in the current context processing frames, instead of forcing a context switch to the presentation engine. This will happen continuously for the total number of swapchain images.

Once the swapchain images are exhausted, vkAcquireNextImage will attempt to request back the first image used; this will stall because it’s still pending to be presented. Which is the first instance where the context switch to the presentation engine will happen and the latency occurs. After that, every single frame that follows will trigger the same stall on vulkan, and context switch to the presentation engine.


However, I also came to the realization that even if the driver chose to context switch after every *vkQueuePresent, the overall latency would still be same (or at least very similar), because the time waiting for synchronization is precisely the time in which the presentation context is running. So, to a certain degree, I have answered my own question.

The only question that remains is, can we avoid the context switch to the presentation engine, or have it occur in parallel to the vulkan execution?

Hi, can you try posting the same question over on the Vulkan forum? I’ve been looking for someone who might be able to answer this but haven’t had luck so far, and I think you may do better with participants who are focused on Vulkan itself.