Actually I’d say it is (kind of) an equivalent situation as that is the only way VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT
is not completely useless. There is no gl_DispatchID
so it is not possible to distinguish between different dispatches unless something like a push_constant
update is done. there is also no
typedef struct VkDispatchBaseIndirectCommand {
uint32_t baseX;
uint32_t baseY;
uint32_t baseZ;
uint32_t x;
uint32_t y;
uint32_t z;
} VkDispatchBaseIndirectCommand;
which allows for distinguishing between different dispatches. And even if this would exist (although that would probably make it quite a bit more efficient for DGC) that would not change VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT
being useless without push_constant
updates, as that would obviously require a different token type like VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_BASE_EXT
for example.
Not being able to distinguish between calls would make VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT
useless also in the sense that, if you cannot distinguish anyway, but you can still use it in your situation, then a single vkCmdDispatchIndirect
would do the trick. However this would be a lot simpler, as no DGC is used, it would take a lot less VRAM, as no preprocess buffer is needed, and it would be a lot more efficient as it just a simple vkCmdDispatchIndirect
call instead of DGC.
There is only 1
situation where VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT
is not useless without push_constant
updates, and that is if every single call needs a different compute pipeline, which to me seems unlikely to be in a situation for that to be required, unlike with fragment shaders. And even if all calls should use a different compute pipeline, the chance that (at least with the current performance) it is faster to use DGC with it instead of just some vkCmdDispatch
or vkCmdDispatchIndirect
is not very big.
Unless there is a way to very efficiently distinguish between dispatch calls that I am missing, I see this as an equivalent test case as VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT
would otherwise be (almost completely) useless.
What happens when you change the former to behave similar to what DGC is doing with thousands of vkCmdDispatches with vkCmdPushConstant between?
Do you get similar performance?
No I do not, when I do use a similar behavior, I now literally have over 2500
separate dispatches (per frame), and I went from 0.59ms
(with ~17 dispatches per frame) to 0.63ms
(with >2500
dispatches per frame) DGC is just really slow.
This means I went from:
auto* data = ...;
data->instance_offset = ...;
vkCmdBindPushConstants(...);
vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u);
to
auto* data = ...;
data->instance_offset = ...;
for (uint32 i = 0u;l i < instance_count; i++) {
vkCmdBindPushConstants(...);
vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, 1u, 1u);
data->instance_offset++;
}
This means that or Nvidia’s compute part of the driver is impressively efficient, or the compute DGC implementation is just like the graphics, also very slow. However, having over 2500
dispatches (with all of them having a different push_constant
) being far faster than DGC seems like a bug to me. With this, the driver does not know what I may do, as these command buffers are re-recorded every frame. However, with DGC, the driver should already know what to expect (it gets a layout with it needing to update a push_constant
(it even already knows the size and the offset) and a dispatch indirect).
However, unlike the graphics test, that was at least close to just the CPU alternative, although only when sorted. Although the non-sorted variant (only EXT, as shown in the first posts) was a lot slower, when the state was sorted, both the EXT and NV variant was around 2.5x
slower than on the CPU. This is a far smaller difference than the compute performance.
Or were you referring to the fact that NV version of the compute DGC is faster than EXT version. I double checked that with a VKCTS test and I think my comparison results were quite similar unless I missed something.
I was not, the sample does not contain a compute test, and I have not tested the NV version myself as the EXT version is recommended to be used going forward by the README.