PCIexpress bus bandwidth and behaviour UNexplained peaks in BW of PCIexpress

Hi there,

recently I altered my Streambenchmark program to probe the PCIexpress Bus and the sustainable bandwidth with CUDA.
You can see my Results in the attached picture (or visit My blog).

The first two graphs show the performance when writing to the device and reading from it.
The other ones do basically the same but not with a single cudamemcpy but with multiple ones, eg. 1 MB or 4 MB up to 32 MB.
The most interesting and fastest ones are plotted.
My question now is:
Why is there a gap between the version transferring 4 MB unblocked and the implementation which transfers larger sizes with 4MB packets.
There should only be a minor overhead because of the additional loops aso.

My guess is some kind of pipelining or caching or handshaking issue.

The other question is more out of curiosity.
Anyone knows what exact protocol change happens when going from 1e6 bytes to 2e6 bytes?

Thanks!