Question about NCCL trace in Nsight System

Hi all,
I’m using Nsight System to trace nccl behavior. However, from my timeline, I could only see the nccl kernel launch, nccl kernel and some nccl nvtx ranges. The real thing I want to know is whether memory copy between host and device excisited during the execution of nccl kernel. From this discussion I learnt that even there is memory copy within the kernel, I won’t see the memcpy API call or memcpy behavior in any cuda stream? Is that correct?
In general, the reason I can’t see any memcpy in nsight time line related to NCCL is the NCCL never use explicit memcpy or it’s just because nsys can’t capture it?
regards,
Liu

What version of Nsys are you running?

I’m going to refer this question to @rdietrich who is our NCCL expert as well.

It is correct that Nsight Systems cannot capture the communication activities triggered by NCCL kernels. And that is because NCCL doesn’t use CUDA memcpy APIs in its collectives (and send/receive). Instead NCCL kernels directly use load/store operations.

You might want to collect GPU metrics (--gpu-metrics-device option) to see the PCIe or NVLink traffic.