NOT enabling debug output makes uploading meshes extremely slow

We are developping a game that often creates meshes at runtime. We’ve been having a curious performance degradation whenever the following lines were NOT run during initialization of our renderer:

			glEnable(_EXT_DEBUG_OUTPUT_SYNCHRONOUS_ARB);
			glDebugMessageCallbackARB(_gl_debug_print, nullptr);
			glEnable(_EXT_DEBUG_OUTPUT);

We run them only if we enable some debug flag in the game engine, and in that case uploading our meshes takes about 100 microseconds. But if we don’t, uploading meshes always takes 15 milliseconds, no matter how big they are (tested from 1 to 512 tris), and this makes the game experience a lot worse.

The lines that take that long are the following:

		PoolVector<uint8_t>::Read vr = array.read();

		glGenBuffers(1, &surface->vertex_id);
		glBindBuffer(GL_ARRAY_BUFFER, surface->vertex_id);
		glBufferData(GL_ARRAY_BUFFER, array_size, vr.ptr(), (p_format & VS::ARRAY_FLAG_USE_DYNAMIC_UPDATE) ? GL_DYNAMIC_DRAW : GL_STATIC_DRAW);
		glBindBuffer(GL_ARRAY_BUFFER, 0); //unbind

		if (p_format & VS::ARRAY_FORMAT_INDEX) {
			PoolVector<uint8_t>::Read ir = p_index_array.read();

			glGenBuffers(1, &surface->index_id);
			glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, surface->index_id);
			glBufferData(GL_ELEMENT_ARRAY_BUFFER, index_array_size, ir.ptr(), GL_STATIC_DRAW);
			glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0); //unbind
		}

In particular, the first call to glGenBuffers is the slowest. The next ones are fast. But if I add the extension, all calls are fast.

This happens on Windows 10 64 bits GLES3, with a GeForce GTX 1060 6GB/PCIe/SSE2.
There were reports of this when using GLES2 as well.
Is it a driver issue or is there something we forgot to do with the OpenGL API?

More details about the issue: Enabling verbose output makes `add_surface_from_arrays` 100 times faster · Issue #52801 · godotengine/godot · GitHub

One minor correction: “queuing” the upload takes 100 microseconds, or 15 milliseconds. Just looking at CPU timings, you don’t know when the GL driver actually performs the actual underlying work you’ve queued.

Also, I wouldn’t expect dynamically creating buffer objects on-the-fly to be a fast operation you should put in your main rendering loop. This may very likely trigger an implicit sync (where the CPU-waits-on-the-GPU/back-end-driver to catch up to a certain point in the command queue). That said…

I would suggest looking at this in Nsight Systems. With it, you can tell how much GL driver queue ahead you’re getting at any given point. Be sure to use NVTX or KHR_debug markup (particularly NVTX Ranges or KHR_debug DebugGroups, respectively), so you can tell exactly where you are on both the CPU and (if desired) the GPU timelines.

I don’t know for sure, but my guess is that in the “debug flag” case, you’re not getting much queue-ahead (GPU and back-end driver aren’t that far behind the CPU). Whereas in the “normal” case, you’re getting nearly 1 60Hz frame of queue-ahead (just less than 16.6 msec) with VSync enabled.

So if buffer creation triggers an implicit sync, then the 1st buffer object creation basically stalls your CPU draw thread queuing the work until the GPU / back-end driver catches up to a specific point in the command queue and then unblocks your CPU draw thread. That’s going to take no time at all in the “debug flag” case (since the CPU isn’t that far ahead), but it’ll take nearly 1 60Hz frame in the “normal case” (since it’s queuing about 1 frame ahead). This is more likely if your app and/or the driver is limiting queue-ahead to 1 frame (explicitly or implicitly).

Anyway, if this is what’s going on, you’ll be able to see this in Nsight Systems.

NV driver configs and GL usage issues that relate to if/how much driver queue-ahead you get with NVIDIA’s OpenGL driver include (but aren’t limited to): Full screen+focused vs. windowed, Low Latency Mode / Max Prerendered Frames, Triple Buffering Y/N, Threaded optimization, Monitor Technology, num GPUs, etc. Of course, max queue-ahead can be thwarted by your app implicitly (via ops triggering implicit sync) or explicitly (by using glFinish() or Fence/Sync operation). You may be hitting the former.

I’m aware of this, and I also see that when the next calls are very fast in comparison, without any relation to how complex they are (detail in linked thread). However I did not expect a 15ms-long wait for calling into OpenGL the first time, it’s so long and systematic that it looked like something is wrong even for main-thread rendering. In fact even if I don’t do mesh upload, I can see the “stall” shifts to the first OpenGL call within the engine’s draw() routine. Perhaps that’s why other people had problems too and were able to “solve” their different issues by doing the “debug extension” workaround, in their cases it made perceived performance better.
So it may very well be OpenGL waiting for a sync point. So I thought it’s either a wrong use, or it’s just… normal, and I have to deal with it somehow (not many options though).

One use case that this causes problem with, is the fact I upload many meshes very fast, which means I’d like to be able to measure dynamically how much can be uploaded for a given time. If it’s too much, it might error/overflow/lag, if it’s too few, updates will feel too slow as resources are underused.

I wish I could do it in another thread. Godot 3 has multithreaded renderer support so that rendering may happen in another thread, however the way it is implemented still stalls the main thread. I’m told OpenGL cannot be used from multiple threads so I can’t do this from another thread of my own. I also got reports of multithreaded graphics support having bugs in Godot 3, so I’ve not focused on it much. The devs plan to improve it in Godot 4, mainly with Vulkan, so likely more control on that.

The only glFinish I could find in Godot is done at the end of rendering under some conditions, and is not called during my test. I could not find any use of glFenceSync.

I installed NSight Systems, included and added an NVTX range around the region where the buffers are created in Godot’s renderer. Unfortunately when I run from NSight it seems to crash before even getting there. No rendering window opens, and the process just stops after a few seconds. If I run the same executable outside NSight it runs fine.

Hi Marc.

Ok, cool. I figured you did. But didn’t want to assume.

That makes sense.

That’s my guess. I’ve tripped over similar implicit sync issues before in NVIDIA’s GL driver (different cause, but similar result), nailed them down, and resolved them. It just takes a little patience. And in hindsight, it makes sense why implicit sync is needed there. I find that the more you “think like the driver” (what would I have to do here if I was the driver dev supporing this GL call sequence), know about the Vulkan method of doing it, and follow NVIDIA’s performance recommendations (OpenGL like Vulkan is a good start), the less you’ll trip over these implicit sync points. But it still happens. This is OpenGL, afterall (very deep driver with lots of voodoo magic going on in the abstraction).

As to options, I’ll mention a few below, but first…

Nsight Systems

I would suggest posting info on your problem in the Nsight Systems forum for the OS you’re hitting this on. The NVIDIA folks there have been very helpful to me in the past diagnosing and fixing similar issues (similarly on the Nsight Graphics forum). The more specific details you can provide on what error is produced and what you are doing in your app/engine that seems to trigger this problem, the more likely it is that they’ll be able to assist and fix the problem you’re hitting.

Toward that end, I’d first suggest that you:

  • Disable as much content and as many effects (GL calls) as you can to achieve a minimalist run config.
  • Retry Nsight Systems on your app

If that gives you a config that works well in Nsight Systems, great! You can repeat this, gradually flipping on the knobs to determine what GL usage is likely triggering this problem. It could be the component causing Nsight Systems issues is irrelevant to your implicit sync repro, in which case you could potentially defer this crash investigation until later.

If you decide not to defer though, this procedure will give you specific info to provide to NVIDIA on what triggers the crash. And if it doesn’t give you a runnable config, then you’ve just carved away a huge amount of code that’s off-the-table, and the issue is in the remainder.

When I first started using Nsight Graphics and later Nsight Systems, I hit/tracked/avoided a few similar crash issues. These stemmed from my app’s extensive use of the GL compatibility API and NVIDIA GL extensions, whereas the NVIDIA apps focus support on the GL core profile with a selected subset of extensions. When in my app’s “Nsight profiling mode”, I disable use of a few extensions not well supported by the tools (e.g. for Nsight Graphics). In particularly related to profiling tool crashes, I disable all use of NVIDIA’s awesome Bindless Graphics APIs (bindless buffer object access) when in “Nsight profiling mode”, falling back to the old bound buffers method of submitting batches. This was important for getting stability in Nsight Graphics, as it did not support use of this extension. Note that bindless texture use is fine. And falling back from CSAA to MSAA textures for render targets was, though not required, useful for getting render target preview working.

Alternatives

As far as that goes… This is your baby, so you can deal with it how you want. But if it were my problem to fix I’d:

  1. Determine exactly what GL call/call sequence is triggering this stall.
  2. Identify what alternatives exist that achieve the same rendered result (focusing on those most like Vulkan).
  3. Iteratively try those, prioritizing the ones likely to yield the highest perf and/or be least invasive to the code.

It sounds like you’ve done part of #1, but your mention of the stall shifting (to the 1st GL call in draw() when skipping the mesh buffer create+upload) suggests there might be more calls in this sequence, or other separate triggers of implicit sync.

Anyway, if you determine that it’s the buffer creation that’s triggering implicit sync (very possible), then I’d suggest reading up on:

methods. That is:

  • get rid of buffer object creation and allocation at draw-time altogether, and
  • sub-allocate from portions of large, pre-allocated, “pool” buffer objects.

NVIDIA’s GL driver provides multiple methods to stream CPU data into and GPU render from these buffer objects efficiently, without implicit synchronization.

Good luck!