GPU Idle synchronization & parallel presentation - Vulkan Application

ScieCode · October 29, 2025, 4:09pm

I’ve been investigating a GPU processing bubble regarding a Vulkan Application.
For a while I’ve been experiencing some latency & extended synchronization regarding the interaction of Vulkan with the presentation engine.
This is common both when running on Windows & Linux Wayland.

The workflow is as follows:

Standard 2 Frames-in-flight, while GPU processes frame N, CPU is recording frame N + 1.
CPU frame fence only blocks until frame N - 1 is signaled, so resources can be re-used.

Frame Job (N):

AcquireNextImage - signals acquire_semaphore.
VkQueueSubmit - signals submit_semaphore.
- submit[0] - Graphics Job - doesn’t wait on semaphore - internal barrier synchronization.
- submit[1] - Swapchain Blit - waits on acquire_semaphore.
VkQueuePresent - waits on submit_semaphore.

Immediate mode presentation (mailbox mode exhibits same pattern).

Based on the GPU Trace, I can see two main issues.

The first, and most obvious issue, is the long synchronization wait for submit[1]. While it waits the signal from acquire_semaphore, indicating last presentation to this image has concluded.

The second potential issue, is what appears to be a lack of parallel execution between the graphics job and the presentation. I have moved the presentation to a different queue both in the same family and to other queue families, yet I’m never able to overlap both things. I wonder if it’s even possible to do so.

There’s also a large idle time until the “unattributed” context (which I imagine is swapchain compositor), however this might be related with synchronization requirements by the OS, and I doubt I would be able to address that, please tell me if I’m wrong here.

Since a lot of time is wasted simply waiting for the swapchain image acquisition, I figured increasing the number of swapchain images could potentially allow for less time waiting on synchronization. However unintuitively, it didn’t. Even increasing to a large number of swapchain images, the acquire_semaphore waiting time stays the same. Which, I can’t seem to explain.

Would appreciate any pointers / suggestions regarding this. Let me know if further information is required, I would share the code, but I feel like it’s heavily abstracted and not super easy to parse. So it would be counter productive.

Thank you.

ScieCode · October 29, 2025, 4:49pm

Just thought I should add.

There aren’t any validation errors. No artifacts, or apparent synchronization problems.

The image index is different for every frame. So it’s not selecting the same image for multiple frames, which would explain the long wait for synchronization, since the image could potentially still being used by the presentation engine.

Every semaphore is unsignaled, and only used when there’s a fence guarantee that it’s previous usage has completed.

ScieCode · October 30, 2025, 9:50pm

I believe I’ve made some progress in understanding the cause for the long swapchain acquisition waits, and also why increasing the number images doesn’t eliminate or lowers the time spent waiting for synchronization.

It has to do with the vulkan/presentation engine using different contexts, and when/why these context switches happen.

Because the CPU is able to record frames faster than the GPU can consume the jobs, once the GPU finishes processing frame N, the frame N + 1 is already queued. If the image acquisition doesn’t force a stall, the next frame can begin immediately.

The driver chooses to remain in the current context processing frames, instead of forcing a context switch to the presentation engine. This will happen continuously for the total number of swapchain images.

Once the swapchain images are exhausted, vkAcquireNextImage will attempt to request back the first image used; this will stall because it’s still pending to be presented. Which is the first instance where the context switch to the presentation engine will happen and the latency occurs. After that, every single frame that follows will trigger the same stall on vulkan, and context switch to the presentation engine.

However, I also came to the realization that even if the driver chose to context switch after every *vkQueuePresent, the overall latency would still be same (or at least very similar), because the time waiting for synchronization is precisely the time in which the presentation context is running. So, to a certain degree, I have answered my own question.

The only question that remains is, can we avoid the context switch to the presentation engine, or have it occur in parallel to the vulkan execution?

JonL · November 18, 2025, 2:35am

Hi, can you try posting the same question over on the Vulkan forum? I’ve been looking for someone who might be able to answer this but haven’t had luck so far, and I think you may do better with participants who are focused on Vulkan itself.

Topic		Replies	Views
Severe user input lag in Vulkan on Windows Vulkan	5	1276	July 26, 2024
Why is my application waiting for a semaphore every 5 frames or so? Nsight Graphics	10	1350	July 11, 2022
vkQueuePresentKHR switches to DX12 context and blocks GPU workload Profiling x86 Windows Targets	1	332	November 7, 2024
Vulkan SDK cube example draw/presentation synchronization not working Vulkan	7	2508	September 8, 2017
Problems with VK_KHR_swapchain Vulkan	5	5493	September 30, 2018
VkQueuePresent takes too long, blocking the frames Profiling x86 Windows Targets	3	1017	September 20, 2023
Vulkan/Wayland vkQueuePresentKHR waits for GPU to finish Linux	3	639	August 3, 2024
vkAcquireNextImageKHR ignoring timeout Vulkan	6	2599	July 19, 2017
presentKHR still blocks on windows even when using VK_KHR_present_wait Vulkan	4	1507	April 21, 2025
Present logic of the Nvidia cards General Topics and Other SDKs	12	1499	February 18, 2024

GPU Idle synchronization & parallel presentation - Vulkan Application

Related topics