Transfer rates in Multi-GPU

kirankumar · April 9, 2011, 6:43am

Hi All,
I’m having a i7 980x cpu connected to a Tesla S1070 (has 4 gpu’s). I’m transferring data from CPU to each GPU at the same time and trying to measure the bandwidth that I can get between individual CPU-GPU.

The obtained transfer rates between the CPU and the 4 GPU are 0.772456, 0.764574, 2.54562 and 2.5455 GB/s.

But when I just transferred data from CPU to just one GPU the obtained transfer rate is 1.56321 GB/s.

I see that

when I transfer data from CPU to all GPU’s at the same time the transfer rate is almost 4 * (transfer rate between CPU and one GPU).
The transfer rate between CPU-GPU in a simultaneous transfer from CPU to all GPU’s can be more than a transfer rate for a CPU to single GPU.

Are my observations correct…?

  For the above experiment I proceeded like this..

I associated one CPU thread to one GPU (Tesla C1060) and the sudo code for the thread is shown below.

Pthread_barrier_wait() // to initiate the transfers at the same time.

CUT_SAFE_CALL(cutStartTimer(timer));

cudamemcpy();

CUT_SAFE_CALL(cutStopTimer(timer));

Regards,
M. Kiran Kumar.

hamster143 · April 10, 2011, 6:38am

Make sure that you’re using pinned memory.

All transfer rates you’re quoting are quite low.
Generally speaking, what matters for the answer to this question, is not so much the CPU, as motherboard and memory.

You must have a X58-based motherboard. X58 can, in theory, sustain two uploads to two different GPUs in parallel. The caveat is that, on some motherboards, it depends where you plug your PCIe cables from Tesla. E.g. if you have 4 slots, it may be necessary to plug them into 1st and 3rd slot.

seibert · April 10, 2011, 9:49pm

The transfer rate you are seeing is very low for an X58 motherboard. Even without pinned memory, most people see around 6 GB/sec Host-to-Device when the CPU is using a triple channel memory configuration.

I would investigate why transfer to a single GPU is so slow first, before worrying about multiple GPUs.

DrAnderson42 · April 13, 2011, 10:46am

How is youre S1070 attached to the host? With a single PCIe port split to all 4 GPUs? If that is the case 2.5+2.5+0.76+0.76 = 6.5 GB/s is an excellent total bandwidth for one PCIe slot.

seibert · April 13, 2011, 1:13pm

Oh, good question! I was assuming that some kind of NF200-style PCI-Express switch was involved when claiming that a single GPU should hit > 6 GB/sec. If some other splitting scheme is used, then it could be much worse.

hamster143 · April 13, 2011, 10:12pm

According to the spec, S1070 has to be attached to the host with two cables, going to interface cards plugged into two PCIe slots. Each slot handles two GPUs. Do you know otherwise?

DrAnderson42 · April 14, 2011, 2:01am

That is the typical configuration. But a different adapter cable is available that attaches all four to a single port. We’re running a number of S1070 gpus that way, it was recommended by an NVIDIA rep when we purchased to leave the 2nd PCIe slot in our machines open for IB. Since none of our applications are PCIe bandwidth limited, I didn’t feel like arguing about it (and it also lets us move them to our other host nodes with only one PCIe port each).

The OP still needs to post numbers with pinned memory bandwidth - that should be much faster than the original results. Maybe tomorrow I’ll pull up our S1070 nodes and run the multi-gpu bandwidth test on them and post my numbers.

DrAnderson42 · April 14, 2011, 6:28pm

Here are the results from the concurrent bandwidth test: The Official NVIDIA Forums | NVIDIA running on our GPUs. The host nodes are opteron 2356 IBM nodes.

S1070 (looks like I got one of the units with 2 PCIe cards each connecting to 2 gpus)

$ ./concBandwidthTest 0 1 2 3

Device 0 took 6743.025391 ms

Device 1 took 6743.016602 ms

Device 2 took 10128.827148 ms

Device 3 took 10128.791016 ms

Average HtoD bandwidth in MB/s: 3161.981079

Device 0 took 7065.671387 ms

Device 1 took 7282.900879 ms

Device 2 took 7752.532715 ms

Device 3 took 8020.559082 ms

Average DtoH bandwidth in MB/s: 3408.044678

$ ./concBandwidthTest 0

Device 0 took 3372.190674 ms

Average HtoD bandwidth in MB/s: 1897.876099

Device 0 took 2265.299072 ms

Average DtoH bandwidth in MB/s: 2825.234131

$ ./concBandwidthTest 2

Device 2 took 4265.948242 ms

Average HtoD bandwidth in MB/s: 1500.252563

Device 2 took 4011.107178 ms

Average DtoH bandwidth in MB/s: 1595.569458

S2050 (definitely attached with 1 PCIe card for all 4 gpus)

$ ./concBandwidthTest 0 1 2 3

Device 0 took 13472.929688 ms

Device 1 took 13472.834961 ms

Device 2 took 13471.897461 ms

Device 3 took 13452.796875 ms

Average HtoD bandwidth in MB/s: 1900.857056

Device 0 took 7526.878418 ms

Device 1 took 7770.839844 ms

Device 2 took 7898.551758 ms

Device 3 took 8044.792969 ms

Average DtoH bandwidth in MB/s: 3279.698669

$ ./concBandwidthTest 0

Device 0 took 3371.951660 ms

Average HtoD bandwidth in MB/s: 1898.010620

Device 0 took 2267.714844 ms

Average DtoH bandwidth in MB/s: 2822.224365

mfatica · April 14, 2011, 6:41pm

This is impossible: S2050 (definitely attached with 1 cable for all 4 gpus)

You can have a single DHIC, but still two cables.

DrAnderson42 · April 14, 2011, 6:56pm

OK, I’m just being too imprecise with my terminology. I don’t actually plug the hardware in (the sysadmins do) so I don’t actually know how many cables there are. What I really mean is one host PCIe port connected to all 4 GPUs (regardless of the number of physical cables). I’ll edit the post to reflect that.

hamster143 · April 14, 2011, 7:17pm

IBM x3455 7940?

As far as I can tell, it has one PCIe 1.x x16 slot (theoretical peak 4 GB/s) and one PCIe 1.x x8 slot (theoretical peak 2 GB/s). And that’s what you’re seeing: the first two GPUs are connected to the x16 slot and the other two are connected to the x8 slot.

Topic		Replies	Views
Multi gpu copy performance Any experiences to share? CUDA Programming and Performance	7	3378	February 3, 2010
Low Aggregate PCI Bandwidth for 9800GX2 CUDA Programming and Performance	14	22133	September 16, 2008
Data transfer between CPU and GPU CUDA Programming and Performance	7	14257	January 30, 2012
Recommended Hardware Configuration For higher transfer rates CUDA Programming and Performance	9	3057	April 6, 2009
Transfer rates for mapped memory is driver involved? CUDA Programming and Performance	6	4096	February 6, 2012
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1634	October 30, 2015
Transfer data from host to device Transfer 10G CUDA Programming and Performance	22	4332	August 24, 2009
low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed CUDA Programming and Performance	9	3651	August 10, 2011
Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla CUDA Programming and Performance	24	11092	December 26, 2008
very large data set (big matrix) CUDA Programming and Performance	10	3004	October 17, 2009

Transfer rates in Multi-GPU

Related topics