Bandwidth disparity between Host-Device-Device-Host

_Sayan · August 23, 2011, 11:45pm

I have attached the Bandwidth test results for NVidia Tesla C2050 (I am doing the DtoD through simpleP2P, after modifying it a bit), my question is, why is the inter-GPU bandwidth less (or nearing HtoD or DtoH) compared to DtoH or HtoD. My guess is, cudaMemcpy is internally doing a Device-Host-Device transfer instead of Device-Device.
My testbed is a node comprising of dual socket 6 core Westmeres, with 4 NVidia Tesla C2050 GPUs, having 4 PCIex 16 way Gen2 buses.

apostglen46 · August 24, 2011, 11:33am

Are you using CUDA 4.0?
I think they added GPU Direct,which means that two GPU can share data through the PCI-E bus without having to copy the data to host first.
Of course you use the PCI-E anyhow,so that is why you don’t get better bandwith than D2H.
Someone at NVIDIA please confirm that,but i am pretty certain this is the case.

Apostolis

_Sayan · August 24, 2011, 3:47pm

Yes, you’re correct, I’m using CUDA 4.0, and yes again, the intention was to avoid CPU in the data transfer, and hence I am expecting if not better then at least equivalent bandwidth. But, you’d notice in the overall plot of D2D that till 16MB, the bandwidth is lesser than D2H and H2D.

Topic		Replies	Views
Device Memory Bandwidth CUDA Programming and Performance	17	8625	January 17, 2018
Bandwidth contention of concurrent H2D & D2H memory copy CUDA Programming and Performance	2	2188	April 17, 2023
PCIe bandwidth issue: H2D very slow (Gen1), but D2H reaches Gen4 GPU-Accelerated Libraries cuda , kernel , ubuntu	0	67	September 17, 2025
P2P peer communication is slower than the bandwidth between GPU and CPU CUDA Programming and Performance	0	3609	June 5, 2011
Concurrent bandwidth with multiple GPUs CUDA Programming and Performance	4	2786	December 12, 2011
PCIe bandwidth is asymmetrical between Host to Device and Device to Host GPU - Hardware pcie , cuda	0	261	January 2, 2025
Bandwith Device to Device - FAQ and reality why is it slower? CUDA Programming and Performance	4	4793	May 24, 2007
cuda 4.0rc2 cudaMemcpyPeer(Async) performance issues CUDA Programming and Performance	11	13149	May 3, 2011
lopsided bandwidthTest: D->H is 3X slower than H->D CUDA Programming and Performance	0	2167	June 3, 2009
Data transfer between CPU and GPU CUDA Programming and Performance	7	14501	January 30, 2012

Bandwidth disparity between Host-Device-Device-Host

Related topics