Extremely poor VK_EXT_device_generated_commands performance

Well I hope, since this simple test case:

  • update a uint with the current index (for a part of a push_constant) + create a indirect dispatch command and put it in a buffer
  • this means this was added in a preparing compute shader which was also needed without dgc, so it is not an entire pipeline dispatch extra or something
  • the extra added code was just this:
#ifdef DEVICE_GENERATED_COMMANDS
    DispatchIndirectCommand dispatch;
    dispatch.x = (sub_mesh_lod.vertex_count + LOCAL_SIZE - 1u) / LOCAL_SIZE;
    dispatch.y = 1u;
    dispatch.z = 1u;

    dispatch_indirect_buffer.instances[absolute_index].command = dispatch;
    dispatch_indirect_buffer.instances[absolute_index].dispatch_index = absolute_index;
#endif
  • the difference in the skinning compute shader was only this, as the absolute_index was already computed in the previous shader and put in the push_constant:
#ifdef DEVICE_GENERATED_COMMANDS
    const uint absolute_index = instance_index;
#else
    const uint absolute_index = instance_offset + gl_GlobalInvocationID.y;
#endif
  • execute the command layout (without an execution set, since the pipeline is not changed)

This took around 14-15ms for the total of a couple thousand instances being skinned. Whereas the wasteful method:

  • get the largest vertex count
  • execute 1 dispatch with enough workgroups for the largest vertex count, like so: vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u).
  • the compute shader uses gl_GlobalInvocationID.y to compute the instance index.

takes around 0.59ms.

I again hope this is due to early implementation, some driver bug. Or at the very least not representative of the graphics part of device command generation, as those implementations are completely different as you have said. Unless I’m doing something completely wrong I unfortunately see no way this will improve my use case, as it is right now ~25.5x slower. I hope that the graphics part of dgc makes GPU-driven rendering with different shaders faster, although I cannot test that, as that would take a lot more effort to implement right now.

Note:

  • This was done with triangle buffer sizes such that at most 150 instances could be skinned at once, this meant that both the wasteful method and the dgc method took 17 dispatches/executions per frame.
  • All instances used the same model, as I did not see the need to put in time and effort to use different models to see if dgc had an advantage due to less wasted workgroups, as it was already a massive ~25.5x slower.
  • I increased the triangle buffer size to be 128 times larger, although making it more wasteful with memory, it allowed for 2500 instances to be skinned at once. This resulted in the wasteful method being only a very small amount faster: 0.56ms instead of 0.59ms. However the dgc method went from 14-15ms to ~170ms per frame.