Transfer rates in Multi-GPU

Hi All,
I’m having a i7 980x cpu connected to a Tesla S1070 (has 4 gpu’s). I’m transferring data from CPU to each GPU at the same time and trying to measure the bandwidth that I can get between individual CPU-GPU.

The obtained transfer rates between the CPU and the 4 GPU are 0.772456, 0.764574, 2.54562 and 2.5455 GB/s.

But when I just transferred data from CPU to just one GPU the obtained transfer rate is 1.56321 GB/s.

I see that

  1. when I transfer data from CPU to all GPU’s at the same time the transfer rate is almost 4 * (transfer rate between CPU and one GPU).
  2. The transfer rate between CPU-GPU in a simultaneous transfer from CPU to all GPU’s can be more than a transfer rate for a CPU to single GPU.

Are my observations correct…?

  For the above experiment I proceeded like this..

I associated one CPU thread to one GPU (Tesla C1060) and the sudo code for the thread is shown below.

Pthread_barrier_wait() // to initiate the transfers at the same time.

CUT_SAFE_CALL(cutStartTimer(timer));

cudamemcpy();

CUT_SAFE_CALL(cutStopTimer(timer));

Regards,
M. Kiran Kumar.

Make sure that you’re using pinned memory.

  1. All transfer rates you’re quoting are quite low.

  2. Generally speaking, what matters for the answer to this question, is not so much the CPU, as motherboard and memory.

You must have a X58-based motherboard. X58 can, in theory, sustain two uploads to two different GPUs in parallel. The caveat is that, on some motherboards, it depends where you plug your PCIe cables from Tesla. E.g. if you have 4 slots, it may be necessary to plug them into 1st and 3rd slot.

The transfer rate you are seeing is very low for an X58 motherboard. Even without pinned memory, most people see around 6 GB/sec Host-to-Device when the CPU is using a triple channel memory configuration.

I would investigate why transfer to a single GPU is so slow first, before worrying about multiple GPUs.

How is youre S1070 attached to the host? With a single PCIe port split to all 4 GPUs? If that is the case 2.5+2.5+0.76+0.76 = 6.5 GB/s is an excellent total bandwidth for one PCIe slot.

Oh, good question! I was assuming that some kind of NF200-style PCI-Express switch was involved when claiming that a single GPU should hit > 6 GB/sec. If some other splitting scheme is used, then it could be much worse.

According to the spec, S1070 has to be attached to the host with two cables, going to interface cards plugged into two PCIe slots. Each slot handles two GPUs. Do you know otherwise?

That is the typical configuration. But a different adapter cable is available that attaches all four to a single port. We’re running a number of S1070 gpus that way, it was recommended by an NVIDIA rep when we purchased to leave the 2nd PCIe slot in our machines open for IB. Since none of our applications are PCIe bandwidth limited, I didn’t feel like arguing about it (and it also lets us move them to our other host nodes with only one PCIe port each).

The OP still needs to post numbers with pinned memory bandwidth - that should be much faster than the original results. Maybe tomorrow I’ll pull up our S1070 nodes and run the multi-gpu bandwidth test on them and post my numbers.

Here are the results from the concurrent bandwidth test: The Official NVIDIA Forums | NVIDIA running on our GPUs. The host nodes are opteron 2356 IBM nodes.

S1070 (looks like I got one of the units with 2 PCIe cards each connecting to 2 gpus)

$ ./concBandwidthTest 0 1 2 3

Device 0 took 6743.025391 ms

Device 1 took 6743.016602 ms

Device 2 took 10128.827148 ms

Device 3 took 10128.791016 ms

Average HtoD bandwidth in MB/s: 3161.981079

Device 0 took 7065.671387 ms

Device 1 took 7282.900879 ms

Device 2 took 7752.532715 ms

Device 3 took 8020.559082 ms

Average DtoH bandwidth in MB/s: 3408.044678

$ ./concBandwidthTest 0

Device 0 took 3372.190674 ms

Average HtoD bandwidth in MB/s: 1897.876099

Device 0 took 2265.299072 ms

Average DtoH bandwidth in MB/s: 2825.234131

$ ./concBandwidthTest 2

Device 2 took 4265.948242 ms

Average HtoD bandwidth in MB/s: 1500.252563

Device 2 took 4011.107178 ms

Average DtoH bandwidth in MB/s: 1595.569458

S2050 (definitely attached with 1 PCIe card for all 4 gpus)

$ ./concBandwidthTest 0 1 2 3

Device 0 took 13472.929688 ms

Device 1 took 13472.834961 ms

Device 2 took 13471.897461 ms

Device 3 took 13452.796875 ms

Average HtoD bandwidth in MB/s: 1900.857056

Device 0 took 7526.878418 ms

Device 1 took 7770.839844 ms

Device 2 took 7898.551758 ms

Device 3 took 8044.792969 ms

Average DtoH bandwidth in MB/s: 3279.698669

$ ./concBandwidthTest 0

Device 0 took 3371.951660 ms

Average HtoD bandwidth in MB/s: 1898.010620

Device 0 took 2267.714844 ms

Average DtoH bandwidth in MB/s: 2822.224365

This is impossible: S2050 (definitely attached with 1 cable for all 4 gpus)

You can have a single DHIC, but still two cables.

OK, I’m just being too imprecise with my terminology. I don’t actually plug the hardware in (the sysadmins do) so I don’t actually know how many cables there are. What I really mean is one host PCIe port connected to all 4 GPUs (regardless of the number of physical cables). I’ll edit the post to reflect that.

IBM x3455 7940?

As far as I can tell, it has one PCIe 1.x x16 slot (theoretical peak 4 GB/s) and one PCIe 1.x x8 slot (theoretical peak 2 GB/s). And that’s what you’re seeing: the first two GPUs are connected to the x16 slot and the other two are connected to the x8 slot.