Why is my application waiting for a semaphore every 5 frames or so?

I wrote a test application that simply draws a lot of triangles in order to make the GPU spend time in vkCmdDraw. I am even only using a single command buffer and fence (no double buffering).

All I want is to understand what is happening.

The application shows that (with certain settings) every so many frames there is a “slow” frame: while normally the time between a submit of the command buffer and the begin of execution on the GPU is microseconds (say, 73 microseconds); every couple of frames there are roughly 14 ms or so in between the submit and the GPU to actually start doing something?!

I determined this using Tracy - but that didn’t tell me anything about the why.

Then someone suggested Nsight. Capturing a single frame doesn’t help at all (that doesn’t even tell me how long the frame was); the only way I could see this stall is by using “GPU Trace Profiler” - capturing several frames and then hoping that the weird skip is present in that capture.

I managed to capture it this way, which you can see here:

The only information that this adds (or even gives) is that the GPU
is waiting for a semaphore: “Wait: Semaphore=4 Value=16777215” (which is 0xffffff by the way).

How can I find out which semaphore that is? And why isn’t it signaled for ~12.3 ms every now and then?

That’s the value GPU Trace uses when it doesn’t find a corresponding signal in the user signals. This is typically coming from system-generated signals, like the one used for the vkAcquireNextImageKHR.

Considering your times are just a bit higher than 16ms, I would look towards how your swapchain and vsync interact.

Thank you for you help! I tried to run my application in Nsight-systems instead, and at least that is showing the VSYNC - which is just a little more information.

Here is a screen capture of a part of Nsight-systems:

This shows that the GPU activity (a single vkCmdDraw, that always uses 14.3 to 14.4 ms) begins more or less at a VSYNC, as opposed to immediately after the submit.

For clarity: I am not looking for a bug. I am assuming that this is normal behavior.
But, could you explain to me why the CPU would wait for a VSYNC as opposed to immediately starting to execute the submitted command buffer? I just can’t think of any reason :/.

Here is a capture made with Tracy:

This shows that (with these settings) every other frame is ~18.3 ms and ~31.7 ms.

Since, apparently, it takes more than 16.67 ms for one frame, I can’t get a frame rate of 60; and it has to sync with the VSYNCs somehow - so getting alternating CPU-frame-lengths is to be expected [The skip causes each pair of frames to be 50 ms long - which is 1/3 of the frame rate of my monitor(s) and results in an average frame rate of 25 ms, or 40 FPS. If each frame duration was of equal length - then the frame rate would have to drop to 30 FPS. Having alternating short/long CPU-frame times seems best.]

So what is my problem? Well, I don’t understand why the GPU stalls after the submit until the next VSYNC instead of simply stalling in vkQueuePresentKHR. There must be some resource not available at the moment of the submit, otherwise it would never stop. Which resource is that?

The resource that’s presumably not available is the swapchain buffer you’re wanting to write to. The OS uses swapchain buffers you’ve filled out when doing desktop composition or scanout to display, depending on exactly which present code your application is getting.
If it’s actively using the buffer you’re wanting to write to, It will not signal the corresponding semaphore until it’s done with it (which will typically happen around a vsync event). How many buffers do you have in your swapchain?

1 Like

I am using two swapchain images. What I did was port API without Secrets: The Practical Approach to Vulkan* - Part 1 to my own engine (that’s a near 100% rewrite; but the app still looks the same, with the sliders and so forth). I am not even sure if they controlled the number of swapchain images at all (ie, together with the number of "frame resources’), but in my case those two are separated. I am currently adding a slider that will allow me to change the number of swapchain images away from the (now) fixed 2.

PS In case anyone is interested in that code (it’s open source, of course); the essence of the application is now in this file: linuxviewer/Window.h at master · CarloWood/linuxviewer · GitHub

As you already realized from the Nsight Systems timeline, the long (red) frames end on the VSYNC timing, which means the swap chain is being exhausted before the GPU can render all the commands that were sent to it in previous frames.

One way to change this is to make the value of minImageCount inside the VkSwapchainCreateInfoKHR parameter to vkCreateSwapchainKHR. For example, if you changed that value to 4 (and your system successfully allocates that many swap chain buffers), you would see the long (red) frame appear every 4 frames instead of every 2.

If this were a real application you were trying to improve the performance for, you would probably want to increase the amount of GPU work being performed, to use offscreen rendering, or simply to disable VSYNC.

Thank you for insights. I am not sure I can follow though :(. Why would it wait 4 frames, instead of just 1 or 2? Is this to “reset” the swap chain and have the maximum amount of free images again? Why not have a shorter hickup, more often. It seems more logical to just skip one VSYNC, and doing that four times more often, then to skip four VSYNCs every time, but less often, no?

I’ve had a quick look at your code. It looks like you’re using a single semaphore for all acquires (I’m not entirely sure, I only browsed). If that’s the case, I don’t see how that can work properly: you really want a semaphore for each swapchain buffer, to know when they’re available.

It’s a trick; I use N+1 semaphores - where N is the number of swapchain images. This is necessary because you already have to pass a semaphore before knowing what resource you’ll get (when calling acquire_next_image). What is done is to use the semaphore of the resource that was returned the previous frame (so you know it is unused in advance). The swap occurs in Swapchain::update_current_index.

Thus,

// m_acquire_semaphore is known to be unused.
uint32_t new_swapchain_index;
acquireNextImageKHR(vh_swapchain, timeout, m_acquire_semaphore, vh_fence, &new_swapchain_index);
m_resources[new_swapchain_index].swap_image_available_semaphore_with(m_acquire_semaphore);

Now m_resources[new_swapchain_index] uses the semaphore that was passed to acquireNextImageKHR, and m_acquire_semaphore contains the semaphore that belonged to the returned resource, and therefore is known to be unused.