Bandwidth disparity between Host-Device-Device-Host

I have attached the Bandwidth test results for NVidia Tesla C2050 (I am doing the DtoD through simpleP2P, after modifying it a bit), my question is, why is the inter-GPU bandwidth less (or nearing HtoD or DtoH) compared to DtoH or HtoD. My guess is, cudaMemcpy is internally doing a Device-Host-Device transfer instead of Device-Device.
My testbed is a node comprising of dual socket 6 core Westmeres, with 4 NVidia Tesla C2050 GPUs, having 4 PCIex 16 way Gen2 buses.
bw_dhhddd_1.png

Are you using CUDA 4.0?
I think they added GPU Direct,which means that two GPU can share data through the PCI-E bus without having to copy the data to host first.
Of course you use the PCI-E anyhow,so that is why you don’t get better bandwith than D2H.
Someone at NVIDIA please confirm that,but i am pretty certain this is the case.

Apostolis

Yes, you’re correct, I’m using CUDA 4.0, and yes again, the intention was to avoid CPU in the data transfer, and hence I am expecting if not better then at least equivalent bandwidth. But, you’d notice in the overall plot of D2D that till 16MB, the bandwidth is lesser than D2H and H2D.