I tried to use NCCL train a net.
It works well on one machine.
But it encounter a issue on two machines with 8 Tesla P40 each.
The error massage is like:
transport/p2p.cu :515 WARN failed to open CUDA IPC handle : 30 unknown error
‘unhandled cuda error’
However, if I disable p2p and shm, it works though the performance decrease a lot.
Anyone can help me fix this problem?