Well I hope, since this simple test case:
- update a
uint
with the current index (for a part of apush_constant
) + create a indirect dispatch command and put it in a buffer - this means this was added in a preparing compute shader which was also needed without dgc, so it is not an entire pipeline dispatch extra or something
- the extra added code was just this:
#ifdef DEVICE_GENERATED_COMMANDS
DispatchIndirectCommand dispatch;
dispatch.x = (sub_mesh_lod.vertex_count + LOCAL_SIZE - 1u) / LOCAL_SIZE;
dispatch.y = 1u;
dispatch.z = 1u;
dispatch_indirect_buffer.instances[absolute_index].command = dispatch;
dispatch_indirect_buffer.instances[absolute_index].dispatch_index = absolute_index;
#endif
- the difference in the skinning compute shader was only this, as the
absolute_index
was already computed in the previous shader and put in thepush_constant
:
#ifdef DEVICE_GENERATED_COMMANDS
const uint absolute_index = instance_index;
#else
const uint absolute_index = instance_offset + gl_GlobalInvocationID.y;
#endif
- execute the command layout (without an execution set, since the pipeline is not changed)
This took around 14-15ms
for the total of a couple thousand instances being skinned. Whereas the wasteful method:
- get the largest vertex count
- execute 1 dispatch with enough workgroups for the largest vertex count, like so:
vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u)
. - the compute shader uses
gl_GlobalInvocationID.y
to compute the instance index.
takes around 0.59ms
.
I again hope this is due to early implementation, some driver bug. Or at the very least not representative of the graphics part of device command generation, as those implementations are completely different as you have said. Unless I’m doing something completely wrong I unfortunately see no way this will improve my use case, as it is right now ~25.5x
slower. I hope that the graphics part of dgc makes GPU-driven rendering with different shaders faster, although I cannot test that, as that would take a lot more effort to implement right now.
Note:
- This was done with triangle buffer sizes such that at most
150
instances could be skinned at once, this meant that both the wasteful method and the dgc method took17
dispatches/executions per frame. - All instances used the same model, as I did not see the need to put in time and effort to use different models to see if dgc had an advantage due to less wasted workgroups, as it was already a massive
~25.5x
slower. - I increased the triangle buffer size to be
128
times larger, although making it more wasteful with memory, it allowed for2500
instances to be skinned at once. This resulted in the wasteful method being only a very small amount faster:0.56ms
instead of0.59ms
. However the dgc method went from14-15ms
to~170ms
per frame.