Small draws impact on occupancy

Hey.

I’m currently profiling a scene, where the the vertex shader is killing all of the performance. The problem is, that the occupany during this pass is very low, a lot of slots on the SM are not used (~90%) even though the shader is barely utilizing the GPU (the units throughputs are very low) and it’s not registers bound.

My theory is that this is caused by how insanely small our draw call are in this scenario. We have a big fat ExecuteIndirect with almost 2.5k commands, each having mostly below10 instances and around 1-2 triangles (triangle or quad).

Before I start changing anything, I’d like to better understand what’s going on here. To my knowledge, each instance has to be a separate warp, so if I have a lot of instanced draw calls with 10 instances of 1-2 triangles, I quickly hit the limit of max warp per SM and the warps themselves are not really full because of the small meshes.

What I don’t know is actually how the work distributor will create warps and workgroups, so all of the above is just speculation, so I’d really appreciate some clarification here before I start to make changes to our geometry processing pipeline.

Just to make sure it’s well understood, here’s what the contents of the ExecuteIndirect look like
image

and here’s the problematic part of the frame

Hello @woookie41, welcome to the NVIDIA developer forums!

Sorry it took a bit longer to reply to your post.

I must admit this is beyond my understanding of DX and our implementation of it, so I need to reach out to some experts. But since this is really specific and low level I cannot promise how fast i find someone with cycles to look into this.

But I’ll try.