Extremely poor VK_EXT_device_generated_commands performance

I tried the nvpro-sample for device generated commands. When I use the same settings for the perfomance metric with a 3080 (sorting OFF), i get

for re-used cmds:
draw: 14 ms

for threaded cmds:
draw: 15 ms

for generated ext:
draw: 90 ms

for preprocess, generated ext:
preprocess: 15 ms
draw: 72 ms

however, with the NV version of the extension (so pushaddress instead of inst.vertexattrib index):

for generated nv:
draw: 7.3 ms

for preprocess, generated nv:
preprocess: 0.2 ms
draw: 7.1 ms

I can only get reasonable results from the EXT version if I use binning and sorting, which makes the extension largely useless (at leasing for drawing, not for computing) as it would require sorting on the CPU:

for generated ext:
draw: 5.7 ms

for preprocess, generated ext:
preprocess: 0.035 ms
draw: 5.4 ms

howevr, still a lot faster on nv based:

for generated nv:
draw: 2.2 ms

for preprocess, generated nv:
preprocess: 0.197 ms
draw: 1.9 ms

I also noticed a HUGE speedup when using shaderobj instead of pipeline:

for generated ext:
draw: 90 ms → 34ms

for preprocess, generated ext:
preprocess: 15ms → 4.3 ms
draw: 72 ms → 30ms

although I’d much rather keep using pipelines.

The sample readme states:

At the time of writing the EXT_device_generated_commands implementation was new, future improvements to performance and preprocess memory may happen.

and

Use EXT_device_generated_commands going forward

so I assume, and hope, especially since the extensions are similar, that this is just a bug, or due to early implementation and that it will be fixed in the near future.

Hi @weteringt those performance numbers are in line with what is expected and described in the README.

We do mention in the readme that the benchmark is an artificial stress-test for tiny drawcalls, that is typically not a real world scenario. The ability to do gpu-driven rendering by occlusion culling etc. enables much bigger gains in performance, that is where these extensions shine. It’s not about doing the same work just on GPU. And at that stage doing the binning of drawcalls on the GPU as well also wouldn’t require the CPU to sort at all.

For now we decided to keep the sample more basic and not add all these other effects, however your feedback suggests that we should add more complexity to show the benefits on the big picture.

So far I can recommend having a look at samples like the vk_lod_clusters or gl_occlusion_culling which showcase how indirect drawing can be leveraged to speed things up by calculating draws on the GPU.

I understand that this is a stress test. I, myself, have already been using a GPU-driven pipeline for a long time. I have hoped for something similar to device generated commoands for a long time to improve my pipeline, as what I have had for a long time is:

  • sort on CPU to have as little possible pipeline binds
  • put them in instance buffers
  • use compute shader to build a draw call buffer (first culling)
  • then draw indirect

I even use custom vertex fetch + buffer address to be able to use a different index buffer + vertex buffer per instance, so as long as the shader is the same, and not too many different materials are used (max 512 or something) it can be done in 1 draw call, een if every instance has a different index and or vertex buffer.

The only thing I was missing was being able to choose what pipeline on the GPU so it could be 1 draw call, and being able to put imageviews into buffers so no descriptor is needed. The pipeline part of this has been mostly solved with device generated commands, although it seems that the execution set has to be updated on the CPU first when chaning its pipelines, although not completely desireable (as opposed to just a buffer which the GPU could fill too), it is enough to fix that problem.

The problem I had with the sample, as mentioned before is not necessarily thahe performance is very poor, but more that the, almost identical nvidia specific extension, which was used as a base for the EXT version, using the same setup has far better performance. So what I thought is that, even though it is a unrealistic stress test, the performance for the EXT and NV versions should be exactly the same, if not the being EXT better since it could be that some potential mistakes could have been improved upon and learned from. However, I’d say that such a huge performance disparity, where the NV version, with the same inputs is more that 10x faster seems like a bug or an oversight.

Hi @weteringt those performance numbers are in line with what is expected and described in the README.

Not entirely right? Because in the README, it shows the EXT version being at most 50% slower (maybe aside from the binning, since there are no numbers given for the NV version). However, I get, with the same setup, the NV version being slightly more than 10x faster with the same settings, and if I use pushaddress for the NV version, it is around 11x faster.

This huge performance difference to me was more of a bug than the poor performance itself. I may have not explained that well enough in my first post.

And at that stage doing the binning of drawcalls on the GPU as well also wouldn’t require the CPU to sort at all.

That’s true, I just entirely forgot about GPU sorting. An updated sample where an efficient GPU draw-call sorting algorithm is implemented would be nice, especially as state bucketing/binning is recommended in the README.

And by the way, it looks like VkGeneratedCommandsInfoEXT::shaderStages is never filled (not for executiing and not for preprocessing), and is always kept as 0. The docs say:

shaderStages is the mask of shader stages used by the commands.

However the sample obviously still works, is this a bug? Does Nvidia not require it (but it should still be filled for other vendors, as required by spec)? Is the documentation lacking and is it only required in specific situations?

Edit: I just found this in the docs:

VUID-VkGeneratedCommandsInfoEXT-shaderStages-requiredbitmask
shaderStages must not be 0

So this is then most likely a bug in the sample I guess.

thanks for all your feedback. I should have phrased it differently, that some of the performance differences between EXT (mutable IndirectExecutionSet) and NV (immutable Pipeline with fixed Shader Groups) as well as key causes are known to us, and that the statement about possible future improvement is still applicable. We were looking for real-world data to drive those optimizations, rather than purely base on this stress-test. That’s why I was curious to hear if you tried applying this technology already in the GPU-driven culling way and see a big difference there as well.

Thanks, will address the bug in the sample (it doesn’t make a difference for us)

We will also investigate whether there is an unknown performance issue in the implementation give the numbers you reported are indeed higher than expected. I missed comparing them with the cpu generated drawcalls, thanks for providing those numbers.

We were looking for real-world data to drive those optimizations, rather than purely base on this stress-test. That’s why I was curious to hear if you tried applying this technology already in the GPU-driven culling way and see a big difference there as well.

I am working on implementing this extension. To implement this for my rendering pipeline will be a bit too much work for now, although not as much work as rewriting a non-GPU-driven of course. What I am using it for initially is compute based skinning. Right now, to prevent huge amounts of dispatch calls I skin different meshes in 1 call, given they all lay within a certain index-count threshold. However, this causes huge amounts of workgroups/subgroups to be idle since I need to dispatch for the largest mesh of course.

My idea is to initially use this extension for something simple to test it out: implement the missing functions:

vkCmdDispatchIndirect(
    VkCommandBuffer                 commandBuffer,
    VkBuffer                        buffer,
    VkDeviceSize                    offset,
    uint32_t                        dispatchCount,
    uint32_t                        stride);

and

vkCmdDrawIndirectCount(
    VkCommandBuffer                   commandBuffer,
    VkBuffer                          buffer,
    VkDeviceSize                      offset,
    VkBuffer                          countBuffer,
    VkDeviceSize                      countBufferOffset,
    uint32_t                          maxDispatchCount,
    uint32_t                          stride);

Although I do not know why they don’t exist. If the resulting numbers give different results from the stress-test I’ll post them here then. Although only without explicit preprocess is possible, as the instances can change every frame.

okay, be aware that under the hood the implementation for dgc compute and dgc graphics are completely different. And so none of the benchmarking results of the sample doing graphics work may transfer.

Well I hope, since this simple test case:

  • update a uint with the current index (for a part of a push_constant) + create a indirect dispatch command and put it in a buffer
  • this means this was added in a preparing compute shader which was also needed without dgc, so it is not an entire pipeline dispatch extra or something
  • the extra added code was just this:
#ifdef DEVICE_GENERATED_COMMANDS
    DispatchIndirectCommand dispatch;
    dispatch.x = (sub_mesh_lod.vertex_count + LOCAL_SIZE - 1u) / LOCAL_SIZE;
    dispatch.y = 1u;
    dispatch.z = 1u;

    dispatch_indirect_buffer.instances[absolute_index].command = dispatch;
    dispatch_indirect_buffer.instances[absolute_index].dispatch_index = absolute_index;
#endif
  • the difference in the skinning compute shader was only this, as the absolute_index was already computed in the previous shader and put in the push_constant:
#ifdef DEVICE_GENERATED_COMMANDS
    const uint absolute_index = instance_index;
#else
    const uint absolute_index = instance_offset + gl_GlobalInvocationID.y;
#endif
  • execute the command layout (without an execution set, since the pipeline is not changed)

This took around 14-15ms for the total of a couple thousand instances being skinned. Whereas the wasteful method:

  • get the largest vertex count
  • execute 1 dispatch with enough workgroups for the largest vertex count, like so: vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u).
  • the compute shader uses gl_GlobalInvocationID.y to compute the instance index.

takes around 0.59ms.

I again hope this is due to early implementation, some driver bug. Or at the very least not representative of the graphics part of device command generation, as those implementations are completely different as you have said. Unless I’m doing something completely wrong I unfortunately see no way this will improve my use case, as it is right now ~25.5x slower. I hope that the graphics part of dgc makes GPU-driven rendering with different shaders faster, although I cannot test that, as that would take a lot more effort to implement right now.

Note:

  • This was done with triangle buffer sizes such that at most 150 instances could be skinned at once, this meant that both the wasteful method and the dgc method took 17 dispatches/executions per frame.
  • All instances used the same model, as I did not see the need to put in time and effort to use different models to see if dgc had an advantage due to less wasted workgroups, as it was already a massive ~25.5x slower.
  • I increased the triangle buffer size to be 128 times larger, although making it more wasteful with memory, it allowed for 2500 instances to be skinned at once. This resulted in the wasteful method being only a very small amount faster: 0.56ms instead of 0.59ms. However the dgc method went from 14-15ms to ~170ms per frame.

If I understand above comment correctly, “execute 1 indirect dispatch with enough workgroups for the largest vertex count” vs " total of a couple thousand instances being skinned."
The latter is using DGC for compute and is thousands of dispatches with push constant updates in between? If so, the two are not identical as the single big dispatch can lead to better occupancy on the GPU.
What happens when you change the former to behave similar to what DGC is doing with thousands of vkCmdDispatches with vkCmdPushConstant between?
Do you get similar performance?

Or were you referring to the fact that NV version of the compute DGC is faster than EXT version. I double checked that with a VKCTS test and I think my comparison results were quite similar unless I missed something.

Actually I’d say it is (kind of) an equivalent situation as that is the only way VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT is not completely useless. There is no gl_DispatchID so it is not possible to distinguish between different dispatches unless something like a push_constant update is done. there is also no

typedef struct VkDispatchBaseIndirectCommand {
    uint32_t    baseX;
    uint32_t    baseY;
    uint32_t    baseZ;
    uint32_t    x;
    uint32_t    y;
    uint32_t    z;
} VkDispatchBaseIndirectCommand;

which allows for distinguishing between different dispatches. And even if this would exist (although that would probably make it quite a bit more efficient for DGC) that would not change VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT being useless without push_constant updates, as that would obviously require a different token type like VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_BASE_EXT for example.

Not being able to distinguish between calls would make VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXTuseless also in the sense that, if you cannot distinguish anyway, but you can still use it in your situation, then a single vkCmdDispatchIndirect would do the trick. However this would be a lot simpler, as no DGC is used, it would take a lot less VRAM, as no preprocess buffer is needed, and it would be a lot more efficient as it just a simple vkCmdDispatchIndirect call instead of DGC.

There is only 1 situation where VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT is not useless without push_constant updates, and that is if every single call needs a different compute pipeline, which to me seems unlikely to be in a situation for that to be required, unlike with fragment shaders. And even if all calls should use a different compute pipeline, the chance that (at least with the current performance) it is faster to use DGC with it instead of just some vkCmdDispatch or vkCmdDispatchIndirect is not very big.

Unless there is a way to very efficiently distinguish between dispatch calls that I am missing, I see this as an equivalent test case as VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT would otherwise be (almost completely) useless.

What happens when you change the former to behave similar to what DGC is doing with thousands of vkCmdDispatches with vkCmdPushConstant between?
Do you get similar performance?

No I do not, when I do use a similar behavior, I now literally have over 2500 separate dispatches (per frame), and I went from 0.59ms (with ~17 dispatches per frame) to 0.63ms (with >2500 dispatches per frame) DGC is just really slow.

This means I went from:

auto* data = ...;
data->instance_offset = ...;

vkCmdBindPushConstants(...);
vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, instance_count, 1u);

to

auto* data = ...;
data->instance_offset = ...;

for (uint32 i = 0u;l i < instance_count; i++) {
    vkCmdBindPushConstants(...);
    vkCmdDispatch(cmd, (max_vertex_count + LOCAL_SIZE - 1) / LOCAL_SIZE, 1u, 1u);
    data->instance_offset++;
}

This means that or Nvidia’s compute part of the driver is impressively efficient, or the compute DGC implementation is just like the graphics, also very slow. However, having over 2500 dispatches (with all of them having a different push_constant) being far faster than DGC seems like a bug to me. With this, the driver does not know what I may do, as these command buffers are re-recorded every frame. However, with DGC, the driver should already know what to expect (it gets a layout with it needing to update a push_constant (it even already knows the size and the offset) and a dispatch indirect).

However, unlike the graphics test, that was at least close to just the CPU alternative, although only when sorted. Although the non-sorted variant (only EXT, as shown in the first posts) was a lot slower, when the state was sorted, both the EXT and NV variant was around 2.5x slower than on the CPU. This is a far smaller difference than the compute performance.

Or were you referring to the fact that NV version of the compute DGC is faster than EXT version. I double checked that with a VKCTS test and I think my comparison results were quite similar unless I missed something.

I was not, the sample does not contain a compute test, and I have not tested the NV version myself as the EXT version is recommended to be used going forward by the README.

You can distinguish between dispatches in DGC with push constants. In fact, you can also emulate VkDispatchBaseIndirectCommand functionality with push constants.
Yes, there are some scenarios where vkCmdDispatchIndirect would do better, but DGC is more flexible allowing you to switch pipelines and have different data per dispatch. They are not equivalent comparison in all scenarios.

Performance wise, we should expect a DGC command with thousands of dispatches to be equivalent to a command buffer with similar number of vkCmdDispatches. I tested it locally in VKCTS fork below, and the numbers are very close, e.g. 561ms for vkCmdDispatch loop and 558 ms for DGC exeucte. See: Compare vkCmdDispatch perf with DGC Compute · vkushwaha-nv/VK-GL-CTS@051fcc6 · GitHub

You can distinguish between dispatches in DGC with push constants. In fact, you can also emulate VkDispatchBaseIndirectCommand functionality with push constants.

Yes, that is why I said:

Actually I’d say it is (kind of) an equivalent situation as that is the only way VK_INDIRECT_COMMANDS_TOKEN_TYPE_DISPATCH_EXT is not completely useless. There is no gl_DispatchID so it is not possible to distinguish between different dispatches unless something like a push_constant

in my post.

Yes, there are some scenarios where vkCmdDispatchIndirect would do better, but DGC is more flexible allowing you to switch pipelines and have different data per dispatch. They are not equivalent comparison in all scenarios.

I know, so I’d expect it to be a bit slower, but not 270x slower than just bruteforcing over 2500 dispatches per frame as I put in my post:

No I do not, when I do use a similar behavior, I now literally have over 2500 separate dispatches (per frame), and I went from 0.59ms (with ~17 dispatches per frame) to 0.63ms (with >2500 dispatches per frame) DGC is just really slow.

Performance wise, we should expect a DGC command with thousands of dispatches to be equivalent to a command buffer with similar number of vkCmdDispatches. I tested it locally in VKCTS fork below, and the numbers are very close, e.g. 561ms for vkCmdDispatch loop and 558 ms for DGC exeucte.

That’s the problem, because that did not happen. as already mentioned.

Did you get a chance to review the example code I shared earlier? It might help determine whether the issue is on the app side or if we’re encountering a corner case triggered by app in our driver. If you’d like, you can share your sample code that isolates this, and I can take a closer look at where the slowness might be coming from.

I’m now building it, so not yet.

I just ran it. However, as it is not realtime, I cannot really see whether it is slower than it should be or not. Since even if it takes >100ms, as a test it will still seem very fast.

If you’d like, you can share your sample code that isolates this, and I can take a closer look at where the slowness might be coming from.

As the code comes from an engine with everything abstracted it is not as easy as just copying something, although I can give some shader snippets.

I tried to capture it with nvidia nsight graphics, but it seems it does not support the extension, so it just says that the vkCmdPipelineBarrier2 takes >1ms per call, whereas in the non-DGC capture it jus says <0.01ms.

Would sending both of these captures (still) be useful, and if so how?

The first shader:

layout(local_size_x = LOCAL_SIZE, local_size_y = 1, local_size_z = 1) in;

layout(push_constant) uniform Constants {
	SkinnedDrawCallBuffer draw_call_buffer;
#ifdef GPU_SKINNING
	FullSkinnedSkeletalMeshBuffer instance_buffer;
#else
	SkeletalMeshBuffer instance_buffer;
#endif
#ifdef DEVICE_GENERATED_COMMANDS
	IndirectDispatchContextInstanceBuffer dispatch_indirect_buffer;
#else
	buffer_ptr padding;
#endif
	uint instance_count;
	uint instance_offset;
};

void main() {
	const uint absolute_index = instance_offset + gl_GlobalInvocationID.x;

	if (absolute_index < instance_count) {
		const uint lod_index = instance_buffer.instances[absolute_index].lod_index;
		const uint sub_mesh_index = instance_buffer.instances[absolute_index].sub_mesh_index;

		MeshInfoHeaderBuffer header_buffer = MeshInfoHeaderBuffer(instance_buffer.instances[absolute_index].mesh_info_buffer);
		MeshSubMeshLod sub_mesh_lod = header_buffer.sub_mesh_buffer.sub_meshes[sub_mesh_index].lod_buffer.lods[lod_index];

		DrawIndirectCommand call = {};
		call.instance_count = 1u;
		call.first_instance = absolute_index;
		call.vertex_count = sub_mesh_lod.index_count;

		const uint material_index_and_instance_flags = PackMaterialIndexAndFlags(instance_buffer.instances[absolute_index].material_index, 
																				 instance_buffer.instances[absolute_index].instance_flags);
		const uint sub_mesh_index_and_lod_index = PackSubMeshIndexAndLod(sub_mesh_index, lod_index);

		draw_call_buffer.draw_calls[absolute_index].draw_call = call;
		draw_call_buffer.draw_calls[absolute_index].material_index_and_instance_flags = material_index_and_instance_flags;
		draw_call_buffer.draw_calls[absolute_index].sub_mesh_index_and_lod_index = sub_mesh_index_and_lod_index;
		// Note: bug on AMD, if retrieve `index_buffer` from tmp variable `sub_mesh_lod`, it will only write half the bytes needed
		// draw_call_buffer.draw_calls[absolute_index].index_buffer = buffer_ptr(sub_mesh_lod.index_buffer);
		draw_call_buffer.draw_calls[absolute_index].index_buffer = buffer_ptr(header_buffer.sub_mesh_buffer.sub_meshes[sub_mesh_index].lod_buffer.lods[lod_index].index_buffer);
		draw_call_buffer.draw_calls[absolute_index].mesh_info_buffer = buffer_ptr(header_buffer);
		draw_call_buffer.draw_calls[absolute_index].first_vertex = instance_buffer.instances[absolute_index].base_vertex_target;
		draw_call_buffer.draw_calls[absolute_index].outline_color = instance_buffer.instances[absolute_index].outline_color;

#ifdef DEVICE_GENERATED_COMMANDS
		DispatchIndirectCommand dispatch;
		dispatch.x = (sub_mesh_lod.vertex_count + LOCAL_SIZE - 1u) / LOCAL_SIZE;
		dispatch.y = 1u;
		dispatch.z = 1u;

		dispatch_indirect_buffer.instances[absolute_index].command = dispatch;
		dispatch_indirect_buffer.instances[absolute_index].dispatch_index = absolute_index;
#endif
	}
}

This is the shader that sets up the dispatch calls, after which a pipeline barrier is done. After that the dispatch, or the DGC execute. The shader for this is just, load some data from buffer device addresses, some matrix multiplications, then write a vertex to a buffer device address, aside from that 1 line shown in an earlier post (how index is computed), this shader is the same for both non-DGC and DGC.

I forgot to mention earlier, but for validation, you can tweak these parameters in the test to switch between DGCC and vkCmdDispatches, increase/decrease GPU workload etc:

constexpr uint32_t kLocalInvocations = 64u;
constexpr uint32_t kWorkgroupSize = 2048u;

constexpr uint32_t kSequenceCount = 3*1024u;

constexpr bool doDGCC = true;
constexpr bool doExplicitPreprocess = false;

It will also dump GPU execution time of the commands using time stamps.

In the tests i did with VK-GL-CTS it returned around the same speed every time, which is weird.

I tried removing the query pool I use, which did not help. I also tried removing the skinning. The skinning dispatches would still run, however, it would not load/write/compute anything, just an empty shader. This got the DGC version from 15ms to 3ms, while it made the vkCmdDispatch version go from 0.60ms to 0.13ms, which may be interesting. The non-DGC version had a speedup then of 4.6x, while the DGC version of about 5x, the ratios are similar but the absolute difference is very large though.

I don’t think that is weird. Internally on our implementation the two methods should be generating similar HW instructions and I don’t expect more than 2-3% variance across runs.

I did not mean that the performance being similar is weird, but that in the test it is similar but in my case it has such a huge difference, while running on the same GPU and driver.