Caffe+NCCL : Check failed: result == ncclSuccess (13 vs. 0) invalid data type

I use caffe-master on github, with Tesla M40 x 4, Ubuntu 16.04, CUDA 9.1, CUDNN v7, NCCL 2.1.15.

I try to train a normal image classification network on multiple GPU card. But I encountered error at the beginning:

I0501 03:45:00.237848 55609 net.cpp:255] Network initialization done.
I0501 03:45:00.238184 55609 solver.cpp:57] Solver scaffolding done.
I0501 03:45:00.244508 55609 caffe.cpp:239] Starting Optimization
I0501 03:45:01.672870 55669 solver.cpp:190] Creating test net (#0) specified by net file: ./ResNet_18_train_val.prototxt
F0501 03:45:02.284466 55609 parallel.cpp:195] Check failed: result == ncclSuccess (13 vs. 0) invalid data type
*** Check failure stack trace: ***
F0501 03:45:02.284466 55669 parallel.cpp:195] Check failed: result == ncclSuccess (13 vs. 0) invalid data type
*** Check failure stack trace: ***
@ 0x7f38606845cd google::LogMessage::Fail()
@ 0x7f38606845cd google::LogMessage::Fail()
@ 0x7f3860686433 google::LogMessage::SendToLog()
@ 0x7f386068415b google::LogMessage::Flush()
@ 0x7f3860686433 google::LogMessage::SendToLog()
@ 0x7f3860686e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f386068415b google::LogMessage::Flush()
@ 0x7f3860e73eca caffe::NCCL<>::Broadcast()
@ 0x7f3860686e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3860e771bf caffe::NCCL<>::Run()
@ 0x40d84f train()
@ 0x40a497 main
@ 0x7f3860e73eca caffe::NCCL<>::Broadcast()
@ 0x7f385f5f4830 __libc_start_main
@ 0x40ae39 _start
@ (nil) (unknown)
Aborted (core dumped)

And I test many times this network can be successfully trained on one GPU card.