How to check what is slowing down the kernel

user116052 · January 31, 2022, 5:21am

Question

I want to analysis the speed of distributed data parallel training of alexnet. But I find that the speed of allreduce invoked during alexnet training is slower than that directly invoked by nccl-tests or pytorch allreduce test.

Environment

pytorch1.5 + cuda9.0 + nccl2.7.8
4x1080ti

Details of Performance

Alexnet has about 244MB parameters, so I test the allreduce performance of this communication volume with 4 devices.

Speed of nccl allreduce by directly testing

I use nccl-tests to test allreduce performance. The time cost is about 50.5ms.
I also test it with the following pytorch code. The time cost is about 50.4ms.

def all_reduce_latency(nbytes):
    buf = torch.randn(nbytes // 4).cuda()
    torch.cuda.synchronize()
    # warmup
    for _ in range(5):
        dist.all_reduce(buf)
    torch.cuda.synchronize()

    torch.cuda.synchronize()
    begin = time.perf_counter()
    for _ in range(25):
        dist.all_reduce(buf)
    torch.cuda.synchronize()
    end = time.perf_counter()
    avg_speed = (end - begin) * 1e6 / 25
    return avg_speed

Speed of nccl allreduce in alexnet training

I use ddp with a very large bucket size to force that all gradient are fused to a single buffer, and the gradient communication is not overlapped with computation. I found that the speed of allreduce is about 67.2ms, which is very slow.

I also tried manually fuse gradients to a buffer to call allreduce instead of using ddp. There is no difference. NCCL allreduce is much slower when called during alexnet training. I don’t know what cause this. Can you give some suggestions to find this problem?

Thanks a lot!

Topic		Replies	Views
Torch allreduce with low performance on cuda12.8 compatibility GPU-Accelerated Libraries cuda , pytorch , nccl	0	138	August 20, 2025
Gaps get bigger and computation gets slower when overlapped with NCCL communication CUDA Programming and Performance	0	591	August 14, 2020
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels GPU-Accelerated Libraries cuda , pytorch , ai-training , a100 , infiniband	0	4337	February 16, 2024
NCCL and D2D data moving across GPU devices CUDA Programming and Performance	0	1195	October 28, 2017
NCCL allreduce in a high performance DGX A100 cluster GPU-Accelerated Libraries nccl	1	465	May 18, 2024
NCCL all_reduce_perf hangs with A100 SXM4 on AMD CPUs (driver 570.172.08 + CUDA 12.8) but works on driver 550.163.01 GPU-Accelerated Libraries cuda , nccl , a100 , software-and-drivers	0	127	September 3, 2025
About NCCL benchmark result GPU-Accelerated Libraries nccl	0	1600	November 17, 2022
Fast Multi-GPU collectives with NCCL Technical Blog	14	1255	May 11, 2018
Proccess block when call Nccl reduce CUDA Programming and Performance	1	800	May 19, 2018
Multi-GPU Training time is slower than single-GPU CUDA Programming and Performance	0	495	February 2, 2023