4.0 vs 3.2 performance simpleMultiGPU: 4.0 vs 3.2

The following is performance of SDK example simpleMultiGPU on 3.2 and 4.0:

On 3.2, performance is worse with pinned and write combined memory rather than with just malloc.

On 4.0, it is opposite: performance is better with pinned and write combined memory rather than with just malloc.

Why?

Thanks.

The following is performance of SDK example simpleMultiGPU on 3.2 and 4.0:

On 3.2, performance is worse with pinned and write combined memory rather than with just malloc.

On 4.0, it is opposite: performance is better with pinned and write combined memory rather than with just malloc.

Why?

Thanks.