Blitting from multisampled FBO to a multisampled default framebuffer is very slow

While I was trying to solve a performance problem in my program, I found that the most time-consuming step is blitting color and depth buffer from a FBO to the default framebuffer. When both the FBO and the default framebuffer are in 1280x720 8xMSAA, it takes about 25 ms to finish blitting on a GT 640. If the blitting target is not the default framebuffer but another multisampled FBO, it takes no more than 6 ms.

Using a fragment shader to copy two multisampled textures (color and depth) to the default framebuffer don’t have this problem. It’s almost as fast as FBO -> FBO blitting.

Can anyone explain why multisampled “FBO -> default framebuffer” blitting is so slow?

My driver is 347.09 on Windows 7 64 bit.

Edit: Uploaded test code as attachment. By default it tests FBO -> framebuffer blitting. If TEX_SRC macro is defined it tests texture -> framebuffer copying with a fragment shader.
buffercopy.cpp (7.38 KB)