Extremely poor VK_EXT_device_generated_commands performance

weteringt · February 21, 2025, 2:49am

Well I hope, since this simple test case:

update a uint with the current index (for a part of a push_constant) + create a indirect dispatch command and put it in a buffer
this means this was added in a preparing compute shader which was also needed without dgc, so it is not an entire pipeline dispatch extra or something
the extra added code was just this:

#ifdef DEVICE_GENERATED_COMMANDS
    DispatchIndirectCommand dispatch;
    dispatch.x = (sub_mesh_lod.vertex_count + LOCAL_SIZE - 1u) / LOCAL_SIZE;
    dispatch.y = 1u;
    dispatch.z = 1u;

    dispatch_indirect_buffer.instances[absolute_index].command = dispatch;
    dispatch_indirect_buffer.instances[absolute_index].dispatch_index = absolute_index;
#endif

the difference in the skinning compute shader was only this, as the absolute_index was already computed in the previous shader and put in the push_constant:

#ifdef DEVICE_GENERATED_COMMANDS
    const uint absolute_index = instance_index;
#else
    const uint absolute_index = instance_offset + gl_GlobalInvocationID.y;
#endif

execute the command layout (without an execution set, since the pipeline is not changed)

This took around 14-15ms for the total of a couple thousand instances being skinned. Whereas the wasteful method:

get the largest vertex count
execute 1 dispatch with enough workgroups for the largest vertex count, like so: vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u).
the compute shader uses gl_GlobalInvocationID.y to compute the instance index.

takes around 0.59ms.

I again hope this is due to early implementation, some driver bug. Or at the very least not representative of the graphics part of device command generation, as those implementations are completely different as you have said. Unless I’m doing something completely wrong I unfortunately see no way this will improve my use case, as it is right now ~25.5x slower. I hope that the graphics part of dgc makes GPU-driven rendering with different shaders faster, although I cannot test that, as that would take a lot more effort to implement right now.

Note:

This was done with triangle buffer sizes such that at most 150 instances could be skinned at once, this meant that both the wasteful method and the dgc method took 17 dispatches/executions per frame.
All instances used the same model, as I did not see the need to put in time and effort to use different models to see if dgc had an advantage due to less wasted workgroups, as it was already a massive ~25.5x slower.
I increased the triangle buffer size to be 128 times larger, although making it more wasteful with memory, it allowed for 2500 instances to be skinned at once. This resulted in the wasteful method being only a very small amount faster: 0.56ms instead of 0.59ms. However the dgc method went from 14-15ms to ~170ms per frame.

Topic		Replies	Views
Poor multithreading performance compared to DX12 Vulkan	17	5434	September 29, 2020
New: Vulkan Device Generated Commands Technical Blog	0	516	August 25, 2020
1000s of dmabuf open file handles reported by lsof (re-posted) DeepStream SDK	12	938	October 12, 2021
NVC++ using external libraries nvc, nvc++ and nvfortran	17	1215	August 2, 2021
Python deepstream-test2 substitue from file to H264 RTP source - Perofrmance low DeepStream SDK deepstream	18	41	March 11, 2025
Sending Frames + MetaData (detections + classes + tracking IDs) to Kafka for each stream running DeepStream SDK kafka , deepstream	31	155	January 23, 2025
Vulkan driver -- uniform buffer bug Vulkan	8	4171	September 24, 2016
Boosting Inline Packet Processing Using DPDK and GPUdev with GPUs Technical Blog	17	1893	June 26, 2023
Opencv Face Detection Poor Performance with jetson nano Jetson Nano opencv	51	14203	October 14, 2021
DirectX12 performance is terrible on Linux Linux	187	20557	April 17, 2025

Extremely poor VK_EXT_device_generated_commands performance

Related topics