As a follow-up to

What about mvapich2 1.8a2? Does this line in the change log mean that P2P should work between ranks now?

- Efficient GPU-GPU transfers within a node using CUDA IPC (for CUDA 4.1)

Because I sure can’t seem to get it to work.

I’ve got some Dell 6100 hosts with m2070’s in C410x boxes attached - connected in an 8:1 configuration. They don’t have any IB cards.

If you simply have to ask why, we run single-gpu jobs 99.99% of the time here, the 8:1 configuration does not slow performance of these runs at all and saves us from spending money on lots of unused host nodes. However, the combination of P2P transfers and the ease of use of device pointers in MPI calls seems like a perfect combo for us to try some multi-gpu code.

All I’m able to get is 500MB/s in the “osu_bw D D” test

CMA: no RDMA devices found

CMA: no RDMA devices found

# OSU MPI-CUDA Bandwidth Test v3.5.1

# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)

# Size        Bandwidth (MB/s)

1                         0.01

2                         0.02

4                         0.05

8                         0.09

16                        0.18

32                        0.36

64                        0.73

128                       1.46

256                       2.92

512                       5.82

1024                     11.63

2048                     23.08

4096                     45.74

8192                     89.64

16384                   172.07

32768                   315.15

65536                   549.34

131072                  525.78

262144                  516.37

524288                  513.17

1048576                 516.52

2097152                 517.11

4194304                 518.28

I’ve got the following env vars set

setenv		 MV2_USE_CUDA 1 

setenv		 MV2_USE_SHARED_MEM 1

and no amount of fiddling with MV2_CUDA_IPC* variables makes a difference.

Surely the internal switches in the c410x are capable of more than 500MB/s?

Can you please try 1.8RC1 version of MVAPICH2. It has improved CUDA IPC based designs for nodes with multiple GPUs. Let us know if you see any performance issues.

Sreeram Potluri

Yep, with 1.8RC1, everything is working great. >6GB/s in the bandwidth test benchmark, and decent performance in my application.