glMapBufferRange overhead

Hello

In my game, I need to stream all vertices to the gpu. Streaming bandwidth is fine, but setting time usually needs ~20 µs by glMapBufferRange(GL_MAP_WRITE_BIT | GL_MAP_UNSYNCHRONIZED_BIT). Also hints of glBufferData doesn’t matter, so I choosed GL_STREAM_DRAW.

As I have to map this Buffer up to 10k times per frame, this will slowdown my game heavyly.

Is there a faster way to upload small changes to buffers? I looked at cudaHostAlloc and GL_AMD_pinned_memory, but both aren’t available for nvidia opengl.

glBufferSubData seems to be much faster, but would need an additional memcpy.

Thanks Markus

It’s me again.

I found a dirty hack as workaround for this problem: Creating a streaming buffer (GL_STREAM_DRAW) and mapping this once and continue using this pointer will exactly do this. But this usage is forbitten by GL, so I have to check it on every system.

For syncing my ringbuffer, I use glClientWaitSync. It seems, that this call will flush the pipeline instead of checking for the fence. Is this common?

Thanks

I can’t let this go uncommented. Relying on implementation dependent behavior is deemed to fail. Don’t do that.
You cannot guarantee that another OS or GPU or driver update or even different memory load will not change that mapping. If that happens during a driver update, people will blame the driver vendor first, but it would be your fault.

That is the real problem. Game engines don’t normally do that. You should think about legal ways how to re-architect your data management to not need 10000 map calls per frame.

You could for example use a host mirror of that data, every write access to that could be tracked inside a range or number of ranges and then you map the whole buffer once and use a few sub buffer calls to update the thing.

You didn’t say how big that buffer is and how expensive it is to render one frame. Maybe it’s even feasible to have two complete buffers on the GPU and let it render one while you fill up the next with a single glBufferData call.

That’s exactly why I asked for AMD_pinned_memory. This “hack” is to meassure the performance I may gain.
I will never release it in this way.

I’m sorry, the internal data management can’t be touched as I develop an emulator. But I may buffer vertices before uploading and drawing them. But so, I also have to buffer all draw calls and uniform/state changes, which would require much work. Without uniform/state changes, I already use a host mirror.
As the emulated platform is a shared memory system, state changes are free and used often. That’s the reason for the high amount of draws and upload calls. Most of this draw calls only use up to 6 vertices.

But as excepted for emulators, rendering one frame isn’t expensive at all.

Long try&error shows, that nvidia has less overhead in glBufferSubData but performance issues on updating big buffers, amd wins with pinned memory extension and intel has less overhead on mapping.

So, finally I get it. With ARB_buffer_storage, I can do what I wanted to do :-)

btw, this extension is only available on fermi and kepler gpus. Is there a hard requirement which tesla doesn’t meet? I thought also tesla does have dma support.