I have an app rendering a large map, map elements have different rendering states, now its poorly implemented, its hard to batching and optimize. So I’m trying to refactor it using gl_NV_command_list extension (Because it seems this extension enables batching for different blend states and mesh formats, thus not required to modify the mesh generation and sorting, correct me if wrong.). Due to the complex of target codebase, i first build a sample project. I’m using GLFW for window management, i have dumped my map data and batching drawing, resulting a boost of performance of 12x from 60ms to 5ms (10000 drawcall to 1).
But here are some problems and confusions i have.
i have noticed when batching rendering to framebuffer, it works fine even if i don’t make attached texture resident. is this correct?
2.to minimize gpu memory usage, i attached renderbuffer to target framebuffer, renderbuffer doesn’t support bindless but it works, is this correct?
3.when i switch v-sync off using glfwSwapInterval, frame render time is 5ms, but when v-sync enabled, frame render time increased to 60ms. Is something wrong between this extension and v-sync? is this know issues?
4.when i try to port this batching system from prototype to my project, i don’t have the same performance boost in prototype. Im my project batch rendering takes 10ms+ time, also, this approach seems introduce extra time, result an overall fps of 30fps where prototype runs at 200fps. what could be wrong with my project code. My project uses QT as window management FYI.
Also i have upload my prototype project to github, any sort of reply would helpful. Thanks.
After some tweaking, i can confirm the v-sync have something to do with this extension. Also i found out that when v-sync is enabled, i can improve the performance by merging the mesh vertex data to larger buffer and use offset to index data. this approach improves v-sync enabled scenario, but degrades in v-sync disable scenario.
i don’t understand why merging buffer results worse performance in v-sync off scenarios. At least it should be same if not better.
Can anyone help me out?
Performance | single buffer | merged buffer
v-sync off | 5ms | 10ms
v-sync on | 30ms | 16ms
My git repository is GitHub - liufangyuan247/NVCommandListSample
the branch one_bulk_buffer is an optimized version, 5ms v-sync off.
branch buffer_manager is the target version, only 10ms v-sync off.
i don’t know why the performances are different.
one_bulk_buffer extract all mesh data and put into one buffer.
buffer_manager use a buffer manager to allocate buffer space ,but i set the block size to 128MB, which effectively put all mesh data into one buffer, yet the performance shows a big difference.
I have difficulty building official sample project on my computer, but my project uses some codes of the project.
I measure the fps using imgui demo window. It should be accurate enough to show the difference in performance.
And also, on my computer, using the devel branch, if i switch off v-sync in window.cpp, it runs at 200fps, swich on v-sync fps drops to 20fps or so. I’ll try to upload a video later.
I didn’t yet find someone who might have some insights on your original question on the extension usage.
But regarding glfwSwapInterval I might be able to shed some light.
What it does under the hood is tell glfw and implicitly OpenGL to wait with swapping buffers until the next v-blank signal comes. naturally if your monitor runs at 60Hz the frame time until Swap will be 16.67ms. If your Monitor is slower it will be longer. But. In case of windowed apps this also depends on the window manager compositor. It will also take time. And of course QT will take some time as well. If this all adds up to enough overhead it easily misses a v-sync and you double the frame time as you measure it. The extension here is completely unrelated to this internal process. SwapInterval simply means that any new commands will be queued for later processing until the current backbuffer has been swapped.
In your GIF above you see it towards the end when you start interacting with the window. Suddenly there are frame times below 10ms as well as frame times around 60Hz.
You could try to use full-screen and log the output to see what will be different.
Another thing you are doing is that you use implicit object destruction for your timer class. I am not an expert on C++ on Linux, but it might do lazy destruction of objects which would also add delays.
I see. never mind the performance issue. One more thing, i noticed than when my VRam is very high(99%), the output image is flickering frequently, i think this maybe related to my render target using renderbuffer and not made resident?
If EVICTED_MEMORY ever increases, you’re overrunning GPU memory and overflowing to CPU RAM. Forget good performance. Can cause stepping, flickering, brief hangs, etc. Assuming this doesn’t happen, GPU mem consumption can be estimated as TOTAL_AVAILABLE_MEMORY - CURRENT_AVAILABLE_MEMORY.
Thanks, that‘s exactly what i did, i sampled CURRENT_AVAILABLE_VIDMEM when app startup and switch to normal render pass if VRAM is not enough. I didn’t examine the EVICTED_MEMORY value, but i assume our app runs cuda program and using a lot of VRAM and also they make cuda buffer pinned (I’m not familiar with cuda, but this looks like make buffer resident?), combined with command list usage leaves GPU little memory to swap, which leads to flicker.