Vulkan GPU Marker seem off

HI,
I am trying to understand the Vulkan GPU Markers visualization. As an example I will use Falcors WhittedRayTracer, which executes a GBuffer-Pass and after that the WhittedRayTracer-Pass.
The CPU-Markers are as I would expect them.


But I do not get why the GPU Markers look this way.


It seems that the WhittedRaytTracer-Pass is running next to the GBuffer-Pass, but only for a very short duration.
I am of course debugging a different application, but the marker placement always is similar to this simple example.
I ran this with the latest drivers and latest nsight systems on Windows and Linux.

@dofek

The GPU workload marker ranges correspond to the execution timings of the individual commands inside the workloads (command buffers). They are produced by inserting timestamp queries (or a driver-level equivalent, in the case of Vulkan applicatoins) into the command buffers and reading their output when the workload finishes executing.

Some operations, like raster rendering (e.g. vkCmdDraw) will always execute in a sequential manner since the GPU only has a single graphics pipeline and a single graphics hardware queue.

Some operations, such as copy operations or compute operations, can sometimes be parallelized so that they run on the asynchronous compute and copy queues in the hardware. This may happen even if the workload, at the graphics API level, was placed inside a graphics / direct queue.

Ray-tracing operations can run on the async compute queues - so in this case, the GPU’s internal scheduler decided that the ray tracing workload inside the WhittedRayTracer marker can be executed without waiting for the previous marker to finish, and ran it in parallel. Apparently, it finished executing faster, so its end also came before the first marker ended.

If there is a resource dependency where this should not be the case (i.e. the ray tracing workload has dependencies on operations being performed inside the first marker), perhaps your application is missing some resource barriers to be inserted between the two parts of the command buffer.

Another option could be to split the operation into two command buffers and set a fence object to synchronize between them - if you want to make sure the entirety of the first operation ended before the second one begins.

Speaking more broadly, if you are unsure how certain execution patterns came to be, another helpful ability of the tool is to show the Windows driver queues by activating WDDM trace. While this information is not aware of the Vulkan debug utils markers, you can select the main “GPU Workload” (green-colored) bar in the queue’s row and it will be correlated to the WDDM events that show it going through the scheduling and execution pipeline. The second option I mentioned before (using two command buffers with or without a fence) will show this even more clearly since they will be two separate workloads in that case.

Hope this helped you understand what is going on here and feel free to ask any follow-up questions if anything is still unclear.

Regards

Thank you for your detailed answer!

I do not think the WhittedRayTracer can run before the GBuffer has finished.
There are barriers in place between all major passes. I also checked timings in Nsight Graphics, which can be seen in the following screenshot


The GBuffer and WhittedRayTracer show about the same execution times, which is very different from what Nsight Systems shows.
Also one can see the barriers between passes. As mentioned before I picked Nvidia Falcor examples, while there can of course be flaws in there, I view them as stable examples.
The following screenshot shows Nsight Systems again and the arrow shows where the Tonemapper is.

This can of course never happen before the GBuffer has finished.

I have tried multiple programs and have seem similar results.
This kind of behavior can be observed inside a single command buffer.
The first pass seems to take as long as all executions combined and the rest of the passes take only a very very short duration (even if in reality they take much longer than the first pass) and happen during the first pass.

Can my recording settings be that wrong to produce such results?