So recently I’ve been doing some research on the slightly dated redraw lag problem. I found some bug reports 1, and also some discussion threads here 3. The suggested (and applied) solution is to use X sync fence and GL_EXT_x11_sync_object. And the stated reason seems to be that it is a “typical race condition that occurs with composite managers that don’t properly synchronize their rendering with the X server” 4 (I’m not 100% sure if that thread is related).
My understanding of the problem is that, the compositor updated the screen with out-dated content of the window, and doesn’t redraw until the next update event is received. Is that correct?
However, I don’t quite how this could happen. I think it should be safe to assume that when the compositor received the DamageNotify event, the content of the drawable associated with the damage should have already been updated. Is it not the case? I am also not entirely clear how sync fence work to solve this problem.
It will be really nice if someone with the knowledge could clarify it for me.
No, it’s not the case, and you’ve nailed the root of the problem.
When a composite manager is using direct-rendering OpenGL, its rendering happens asynchronously to the X server’s. The X server sends DamageNotify as soon as it has written its rendering commands to the GPU’s command buffer, but it does not wait for those commands to actually execute on the GPU.
You can imagine the X server’s rendering and the compositor’s OpenGL rendering as running on two concurrent GPU threads that share memory. If the compositor’s GPU thread starts reading from the shared memory before the X server’s GPU thread has rendered to it, the compositor will read stale data.
X sync fences are a way of ensuring that the compositor’s read operation won’t start until the server’s write operations are complete.
From what you said, it seems that just using a XSync{Trigger, Wait}Fence pair should suffice in making sure the server’s write operation are complete. So why is EXT_x11_sync_object needed? Is XSync{Trigger, Wait}Fence only guaranteed to sync non-GL render operations?
Also, not trying to blame NVIDIA or anything, but why does this issue mostly happen with NVIDIA cards? Is it because of some implementation differences? I noticed that NVIDIA is the only driver that implements the EXT_x11_sync_object extension1, how do other drivers solve this synchronization problem?
XSyncTriggerFence just tells the X server to trigger the fence when its rendering its done. XSyncAwaitFence tells the X server to ignore your client until one or more fences are triggered. Neither one affects OpenGL rendering, since that’s happening asynchronously to everything X is doing.
EXT_x11_sync_object allows an OpenGL application to import an XSync fence into OpenGL so that it has a way to make the OpenGL rendering wait until the X11 fence is triggered, thereby synchronizing the two otherwise-asynchronous rendering threads. This is important for compositors because the X server and the compositor are sharing memory in the form of window pixmaps that the compositor imports into OpenGL via EXT_texture_from_pixmap. In general, asynchronous rendering contexts (e.g. X11, OpenGL, Vulkan) that share memory require some sort of synchronization primitive. XSync fences and the GL_EXT_x11_sync_object extension provide the synchronization that’s missing from the EXT_texture_from_pixmap extension.
This is generally more of a problem with the NVIDIA driver because it allows rendering threads to run asynchronously. From what I understand, other drivers often serialize rendering in the kernel so that everything runs on the GPU in the order that it was submitted from the CPU, which is why this isn’t a problem on those drivers.
Just a few more questions before I stop bothering you :-)
It sounds like I really just need to have one XSync fence, import that into opengl, and use that to sync rendering, right? But this mutter patch1 used a ring of ~10 sync fences, why is that? I feel there might be a race between ResetFence and glWaitSync, but can’t come up with a practical scenario where this could happen.
Continuing from the above question, what is the best way to use the fences, and why?
If I understand things correctly, the following pseudo code should work:
fence = XSyncCreateFence(...)
XSyncTriggerFence(fence)
XSyncAwaitFence(fence) // <- X request processing paused for this client
TrivialXRequestWithReply(...) // <- This should block until request processing resumes
// which means the fence should have been triggered when we get here
// Start gl rendering here
Sorry this stuff is so confusing. It’s definitely hard to reason about, but it helps to keep in mind that there are four concurrent threads here: a compositor CPU thread, a compositor GPU thread, an X server CPU thread, and an X server GPU thread.
The ring of sync fences helps for performance. It’s fairly expensive to create a fence and share it with OpenGL, so the fences are created up-front and reused. The problem with reuse is another race: The OpenGL client needs to wait until the GPU has consumed the fence wait command before sending an XSyncResetFence request to the X server, and then the client needs to wait for an XSync alarm to indicate that the reset is complete before it inserts another wait for that fence into the OpenGL command stream. If only one fence were used, this would serialize the compositor’s CPU thread, the compositor’s GPU thread, the X server’s CPU thread, and the X server’s GPU thread. Having a ring of fences allows triggering, consuming, and resetting of fences to all be in flight at the same time.
I think James’s Mutter patches are a good demonstration of how they should be used, but feel free to ask for clarification if there’s something confusing about them.
The problem with this pseudocode is that XSyncTriggerFence() doesn’t wait for the fence to be triggered, it just writes the trigger command to the GPU’s command stream. TrivialXRequestWithReply() will return as soon as the X server’s CPU thread has processed the trivial request, but it does not wait until rendering queued to the GPU by earlier requests is complete. So there is no guarantee that the GPU has triggered the fence before the GL rendering starts.
One option would be to wait on the CPU for the fence to be triggered, but that’s slower than necessary. Using glWaitSync() allows the compositor to continue processing on the CPU while the GPU waits for the XSync fence to be triggered.
Oh sorry, I missed the XSyncAwaitFence() call in #3. That will indeed block the client until the fence is triggered, but the X server will continue processing requests for other clients. That puts the compositor’s CPU thread well behind of where it could be, causing stutter and sluggishness. Using glWaitSync() instead of XSyncAwaitFence() blocks the GPU thread instead of the CPU thread, allowing the compositor to continue processing more Damage events.
I think this is a cost effectiveness thing. EXT_x11_sync_object is only supported by NVIDIA, and requires a couple hundred lines of code to use optimally. Given that, I think it would be better for me to implement that pseudo-code in a handful of lines and see how sluggish it can actually be.
The performance difference between using a ring vs using a single fence probably won’t be very noticeable for simple compositors like compton, right?
Thanks, am reading through the code carefully right now.