Hi!
I’m investigating a curious issue on linux, when using Vulkan timeline semaphors.
At one point of our pipeline, we have a large number of the following workflow:
for 1 to 40:
CommandRecord();
Submit(waitForTimeline: i, signal: i+1) //This submits the commands, and signals i+1 once it finishes
vkWaitSemaphors(waitForTimeline: i+1, timeout=UINT64_MAX); // This line waits on the CPU for the GPU commands to finish
(There are multiple threads doing the similiar things, but the timeline index is properly synchronizedacross threads)
Now the GPU workload usually runs for ~200 microseconds, however in some seemingly random cases it takes almost exactly 10 milliseconds.
In Nsight, I can see that the 10 milliseconds consists of a 200microseconds of GPU work, but after that there is absolutely no workload neither on the GPU, and the CPU.
The interesting thing is, that if I replace the vkWaitSemaphors() with the following, logically equivalent solution, then this 10ms bubble completely disappears, and performance is good again:
while(vkWaitSemaphors(waitForTimeline: i+1, timeout=5000) == VK_TIMEOUT) {}
This does a busy wait polling on the timeline semaphor, and for some reason it works fine.
The driver I use is: 535.171.04, but i have reports of this on the latest drivers as well
This issue is not present on windows.
Here is an image from NSight with the 10ms of no workload: