Performance bug in glDrawElementsInstanced

Hello. I’ve discovered a bug that presumably affects all Nvidia cards. glDrawElementsInstanced() is extremely slow, but glDrawArraysInstanced() is fine. I’ve created a small test program that reproduces the problem. The program renders 524 288 small 1-pixel quads over a window.

  • My pseudo-instancing method manages 93 FPS (uses glDrawElements() and draws 512 instances per batch).
  • Rendering 512 instances per batch using glDrawArraysInstanced() gives me 43 FPS, which is decent.
  • Rendering 512 instances per batch using glDrawElementsInstanced() gives me an abysmal 10 FPS.
  • Rendering all instances using a single call to glDrawArraysInstanced() gives me 104 FPS as expected!
  • Rendering all instances using a single call to glDrawElementsInstanced() gives me 11 FPS.

The test can be found here:

  • Run by starting run.bat. You need to have Java installed. If it can’t find Java, try to hardcode the path to java.exe in the bat-file.
  • The program pops up an option box where you can choose one of the 5 modes above. The peudo-instancing mode and the pure instancing modes are the interesting ones.
  • FPS is printed to the console every second.

The source can be found here: and requires LWJGL for OpenGL access. Shader source can be found in the shaders/ directory that comes with the test program. I’d like to point out that to go from 104 FPS to 11 FPS, all you need to do is to change a single line from glDrawArraysInstanced() to glDrawElementsInstanced():

if(mode == 3){
    glDrawArraysInstanced(GL_QUADS, 0, 4, NUM_QUADS);
    glDrawElementsInstanced(GL_TRIANGLES, 6, GL_UNSIGNED_SHORT, 0, NUM_QUADS);

The bug has been confirmed on all cards I’ve managed to test it on:
GTX 295 (both with SLI on and off)
GTX 460m (laptop)
GTX 460
GT 630
GTX 680

This bug is very seriously affecting my game’s performance since the environment is rendered using 8000 instances of a few meshes, forcing me to use a much less efficient Nvidia specific batching workaround to avoid the heavy performance hit.

Original post on

That is not really a bug.

Do NOT use GL_QUADS! Although the drivers surprisingly aren’t doing a very good job at breaking them up in triangles, you really should always be using triangles.

This IS a bug, and it’s still there. Drawing a quad using GL_QUADS and glDrawArraysInstanced() is 9.5x as fast as drawing 2 triangles using GL_TRIANGLES to form a quad using glDrawElementsInstanced(). I repeat, GL_QUADS is almost 10x FASTER than triangles since they use glDrawArraysInstanced() instead of glDrawElementsInstanced(). Please reread my post.

Also, the GPU load reported by GPU-Z is much higher when using glDrawElementsInstanced() compared to batching (duplicating data the data X times in a VBO so I can draw X “instances” at a time using a standard glDrawElements() call):

Batching: 108 FPS, 56% GPU load, 26% memory controller load
Instancing: 83 FPS, 60% GPU load, 21% memory controller load.

If I inflate the amount of terrain rendering (which uses instancing) I get the following values:

Batching: 35 FPS, 50% GPU load, 11% memory controller load
Instancing: 14 FPS, 58% GPU load, 6% memory controller load

If I run the same test on a computer with a weaker GPU, I get the following results:

Batching: 107 FPS, 97% GPU load, 50% memory controller load
Instancing: 70 FPS, 82% GPU load, 33% memory controller load
Please note that both of these two tests are heavily GPU bottlenecked. They ran the exact same FPS regardless of if the CPU clock is at 3.9GHz or at 2.5GHz. Note in particular how the GPU is unable to hit 100% before bottlenecking!

This is using a real world test which does NOT heavily rely on instancing in the first place, hence the performance hit is not as severe as in the test above, but still very noteworthy! Please look into this!!!


I managed to reproduce this as well. 2 milion particles.
Fixed positions, only difference between the two tests were just as yours ( glDrawElementsInstanced vs glDrawArrays with GL_QUADS).
I had more than double performance with GL_QUADS.
I could imagine that this is because of the way we’ve set the indices for our quads, maybe they end up trashing some cache in there ? I am not sure. It was an interesting find tho.
Soon I will also upload my code to github or something.

Thank you a lot for testing!

It could be that the cache is flushed or something between each instance when using indices or something, but cache misses are not the problem. You can confirm this by using quads and a “pass through” index buffer (simply containing the numbers 0 - 65535). It doesn’t matter what primitive you render. Simply using any index buffer at all kills performance.