Extremely slow glDrawElementsInstanced compared to glDrawArraysInstanced

I’m rendering 1 million instanced quads using glDrawElementsInstanced. The function call itself takes ~11ms on CPU. When I use glDrawArraysInstanced it takes only ~2ms and when I use glMultiDrawElementsIndirect it’s only 0.002ms. glDrawElementsInstanced time scales linearly with number of instances

Simple exe with reproduction: slow_exe.zip - Google Drive
Source code: slow_src.zip - Google Drive

tested on i7-4790, 32GB mem, gtx960, driver 430.86, win 8.1 and win 7

Related issue: https://devtalk.nvidia.com/default/topic/548150/opengl/performance-bug-in-gldrawelementsinstanced/

Edit:
All cases are fast on AMD cards.