Gaps get bigger and computation gets slower when overlapped with NCCL communication

Hello, Dear NVIDIA engineers, I am writing a distribution resnet50 training program. But I cannot get linear speedup for 2 GPU. So I use nvprof to dump the timeline, and found some strange behaviors:


Gaps get bigger and computation gets slower when overlapped with NCCL communication.
Those gaps make my distribution training program only get 1.8 speedup for 2 GPUs.
NCCL version is 2.6, CUDA version 10.2.
How can I improve it? Do I miss some configurations?
I found a simular issues on github, but no one replies it yet:
https://github.com/NVIDIA/nccl/issues/357