Problems with Pinned Memory on multi S1070 system

I have developed some MPI/CUDA code that runs fine over the 4 devices in one S1070. The code used cudaMallocHost to allocate pinned memory on the host. When this code was transferred to a multi-S1070 system it ran fine on one S1070. But when I tried to run it over two S1070s there is a problem with the cudaMemcpyAsync in that the code hangs. It does not crash, but hangs. When I do not use pinned memory the program runs OK. In all calls to cudaMemcpyAsync the stream is 0.

After some research on these forums my problem seemed to be related to not using cudaHostAlloc with the cudaHostAllocPortable flag to allocate pinned memory. So I tried this, but I got the same problem of the code hanging.

Does anyone have any suggestions as to what the problem is, and/or how to rectify it?

Thanks in advance.

Are these two S1070s connected to the same machine? 8 pinned memory allocations on a single machine sounds like rather a lot. It might just be the VM system thrashing if the total memory involved is pretty large.

It looks like a problem with RDMA and Infiniband.
To see if this is the cause, try to run using Ethernet ( with OpenMPI, --mca btl ^openib).

To solve the problem, you will need to disable RDMA in Infiniband (the flag for OpenMPI is “–mca btl_openib_flags 1”. You can pass it as an argument in mpirun) or try to use GPU Direct.

I haven’t tried it myself yet but I’ve been told by the system admin that disabling RDMA as you suggest works.

Thanks!