No matter what I do, I can’t get uniform buffer objects to perform better than regular glUniform* calls. Is UBO performance a known issue, or could you provide a “best practice” for using them?
…Could you give a little more detail about your use case?
-update frequency.
-how many shaders are sharing the UBO.
-UBO size
etc.
You might want to look at this presentation from the last NVIDIA GTC for best practices and performance comparisons of rendering methods and parameter updates:
[url]http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdf[/url]
Related topic from the year before:
[url]http://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdf[/url]
There should also be recordings of these presentations on
[url]http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php[/url]
I’m doing what the presentations suggest, but the performance is still worse than with regular glUniform calls. Other people have the same problem (google “ubo performance”). I split the uniforms into two uniform blocks. One block contains uniforms that are frequently updated (matrices and whatnot) and the other contains static uniforms updated just once per frame. I use glNamedBufferSubDataEXT to upload the data. I’ve also tried several other variants.
The application is a military flight simulator. We’re currently CPU bound. We’re using direct state access, bindless, streaming and interleaved VBOs. For my test scene, there are about 40 unique materials (array textures to reduce material counts). This is linux so I’m running 331.79 on a 780.
Note that the performance comparisons of the first linked presentation have been between different, newer(!) drivers than you’re using. Please check again with the next upcoming driver generations. E.g. beta versions of 340.xx are already available.
If that doesn’t help getting an improvement on the parameter update performance, some more analysis of the bottleneck would be required.
I installed the 340.17 driver. Here are my timings. I’m using the default shared layout. This is with a 780 and a core i7 960. I’ll have to try this with a more modern CPU at some point.
9.5 ms all uniforms in a uniform block
8.75 ms static uniforms (updated at most once per frame) in uniform block
8.25 ms no uniform blocks
You can also try to implement your uniform buffers using coherent and persistent mapping. I had the very same issue, glUniform was way faster than any buffering call. When I switched to using persistent mapped coherent buffers and syncing myself, I got a big performance boost.
It should also be noted that in order to make uniform buffers as efficient as possible, you should consider buffering them so that you can write to a section of the buffer which isn’t in use, and as such avoid stalling.
However, I should note that not everything in my project is using uniform buffers, only the stuff which needs either extremely frequent updating (guaranteed per object stuff like transform matrices) or not so frequent updating (per frame stuff like view matrices and such).
We were able to open-source our work which lead to the above GTC presentation results now.
Please have a look at https://devtalk.nvidia.com/default/topic/777618/scenix/announcing-nvpro-pipeline-a-research-rendering-pipeline/
That should allow you to investigate the different options available to pass parameters to GLSL shader programs and possibly overcome your current UBO bottlenecks.
Please mind that all results presented have been on Quadro boards and also rely on improvements inside the OpenGL driver itself, so use the newest ones available when benchmarking.