Optimal vertex and index layout order on modern GPU's?

Hi there.

I’m coding for modern GPU’s. ( OpenGL 4.x etc… )

The mesh data I’m sending to the graphics card is pretty much directly output from Maya. I’m guessing they will have poor vertex-cache-ordering.
I’m hoping to use a vertex-cache-optimization pre-pass on the meshes to gain some performance. As suggested here
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html

The meshes have 500000+ triangles in them. ( Rendered as triangles via index buffer )
And we do a number of shadow passes too. Which puts more requirements on vertex through-put.

The thing is…
I have tried to use TomF’s algorithm (as described in the link) and it actually made things slower! :( And I definitely performed all steps.
ie

  • index buffer re-ordering
  • rebuild vertex buffers using the new index ordering to achieve near-linear access

Any idea why this would be? I’m guessing the assumptions that Tom made back in 2006 do not hold for modern GPUs?

NOTE:
I used both of these implementations

And both achieved the same slowdown. ( from 31fps to 29fps )
So I’m guessing both have a consistent ( and therefore hopefully correct ) implementation.

My Question:
How should I approach this problem given modern-day architectures?
How should I be ordering the data to achieve the best performance on the card?
Is it worth doing anything at all?

Thanks a lot! :)
Brian

First of all, I can’t see how the optimized mesh’s vertexes should be processed any slower if there is still the same triangle and vertex count, unless the original mesh just happened to be extremely well optimized and the algorithm you used did not work that well on your particular case. (You should test to randomly rearrange the mesh and see if it gets slower?)

Sorry if I’m not to too much use, I don’t have the time to try it out myself right now, but I have some questions regarding your results.

Have you tried other potential changes, like splitting the mesh into 10 sub-meshes (and be okay with 10 draw calls instead of one) and optimize each of these meshes? Can that impact the frame rate? How many meshes do you draw each frame? If you get such low FPS, I guess you are drawing a lot of them? (Including the shadows and everything, how many triangles are you drawing each frame?)

How are you measuring the frame rate? Is it consistently 2 FPS slower, and you are sure there are no external interference? Are you using Vsync when measuring?

What fragment shader are you running on them and have you tested to see if you get more overdraw with the “optimized” mesh compared to the previous mesh?
If you have an expensive fragment shader (high quality shadows or something similar), and you get a lot more overdraw, that can explain the slowdown. The vertex shader might run faster, but maybe you run more fragments? Maybe you end up drawing a lot of small and far away triangles to the buffer now, resulting in a lot of fragments, and then overdraw them with something else. That before were early culled because of an early Z test? Try modifying the fragment shader to stop any early Z tests and see how that impacts the FPS with the different meshes, and also try to just replace the fragment shader with a simple one color, or checker texture. Maybe that makes the FPS difference smaller as overdraw will make a smaller impact with a cheaper fragment shader.

Have you tried it on different “modern” GPUs/different computers?

I don’t know how much any of that really helps, but at least maybe we can get closer to what is causing the slowdown. There can be other things then just the vertex cache/processing.

Good luck, and I hope to hear from you if you get any further with this! :)

Thanks for the reply Gafgar

Our renderer is not a game renderer. It renders a ton of shadowmaps on a per-frame basis to achieve high quality shadows ( this is why the low fps). So I’m fairly confident we’re vertex-shader bound rather than fragment.
But nonetheless, I need to try everything because this result does not seem right.

You just raised a very interesting point regarding overdraw.
All my measurements were with a single view direction, with a single light pointed from a side angle.

Maybe the index buffer has re-ordered the mesh so that it is rendering in a more back-to-front ordering. Where previously it was (maybe) doing a more front-to-back ordering. This might explain the slowdown we’re seeing.

What I’ll try on Monday is this…

  • Simple OpenGL drawing using a flat-shaded shader ( no shadows or anything fancy ). And simply render the model 25+ times, to really thrash the card.
  • Render with, depth test off, depth write off, and triangle culling off.

This should remove any fragment shader and overdraw issues from effecting the result.

I’ll let you know how I go.
Thanks
Brian

Hi

I’ve tried the new changes.
Success! : )
So this is rendering with depthtest=false, depthwrite=false, cullface=false.
And with a basic fragment shader, all shadows off. And rendering the mesh 25x.

From Maya = 17fps
After optimization = 17.5fps
So its good to see a speedup. Albeit only 3%.

Out of curiosity I decided to randomly shuffle the index buffer ( as opposed to optimizing it ) to see how bad things got.
After random shuffle = 6fps
interesting…

I found this library to execute the fastest. ( both produced the same fps once done )

Thanks for the help!
Brian

Nice to hear that you found the source of the problem! : )

But I think that you should also consider to probably split the mesh up and draw the closest parts first, or at least first draw some decimated version of the mesh in a Z only pass, to speed things up a bit? That is, if you still are planing to run an expensive fragment shader? : ) At least I think you should try it out. Also, make sure that your fragment shader doesn’t do anything that might stop an early Z culling.

Decimating the mesh and running it through a early Z pass should be a quick and easy test. But to make sure that the decimated mesh always stay inside the original would be a little bit harder, but should not be impossible. However, doing a quick test and see what impact it has on the FPS, even if it in an unpolished state can create some small errors, should be quick and easy. (I wonder if there would e an easy solution for making sure the low poly versions stays behind the high-poly mesh through offsetting the vertices away from the camera by some small decimals in the vertex shader (probably after the projection)… it should probably still stop a lot of instances overdraw)

You did mentioned rendering a lot of shadows. I know a lot of projects tend to render a LOD version of the meshes for the shadows. But I don’t know what kind of shadows you are rendering, in what kind of project, so maybe it is not possible in your case. (I usually use the Assassins Creed games as a reference when talking about LOD in shadows… it is easy to see if you are looking for it. Windows and similar small profile objects on houses do not cast any shadows on a distance ( == not drawn in the further away cascades)) The trick is as always to sheet where nobody notices ;)

Pleas post some results if you manage to optimize it further! :3 It would be interesting to see! (or if you have a blog where you are going to post results or similar)

Any ways, good luck with you project! ^^