Extremely slow glDrawElementsInstanced compared to glDrawArraysInstanced

I’m rendering 1 million instanced quads using glDrawElementsInstanced. The function call itself takes ~11ms on CPU. When I use glDrawArraysInstanced it takes only ~2ms and when I use glMultiDrawElementsIndirect it’s only 0.002ms. glDrawElementsInstanced time scales linearly with number of instances

Simple exe with reproduction: https://drive.google.com/open?id=1zzKl0Mw4CliaktEp4GYaVJJgrH2M46x-
Source code: https://drive.google.com/open?id=1aJew9EqPN1foODvTUqQluEMn7FB4BlIb

tested on i7-4790, 32GB mem, gtx960, driver 430.86, win 8.1 and win 7

Related issue: https://devtalk.nvidia.com/default/topic/548150/opengl/performance-bug-in-gldrawelementsinstanced/

All cases are fast on AMD cards.