I have developed some MPI/CUDA code that runs fine over the 4 devices in one S1070. The code used cudaMallocHost to allocate pinned memory on the host. When this code was transferred to a multi-S1070 system it ran fine on one S1070. But when I tried to run it over two S1070s there is a problem with the cudaMemcpyAsync in that the code hangs. It does not crash, but hangs. When I do not use pinned memory the program runs OK. In all calls to cudaMemcpyAsync the stream is 0.
After some research on these forums my problem seemed to be related to not using cudaHostAlloc with the cudaHostAllocPortable flag to allocate pinned memory. So I tried this, but I got the same problem of the code hanging.
Does anyone have any suggestions as to what the problem is, and/or how to rectify it?
Thanks in advance.