Hi,
I would like to profile nccl kernel and get some detail metrics by using nsight compute, but it always hang. Can anybody give me some information about this? Thanks.
Nsight Compute serializes kernel launches across all profiled processes. If a kernel waits for other concurrent processes (or kernels) it will not be able to make forward progress and the profiling will hang. So such applications cannot be profiled using Nsight Compute.
Nsight Compute 2022.1 includes a new Range Replay feature to support profiling mandatory concurrent kernels (such as NCC all reduce). Range replay requires you to mark explicit ranges of kernels (and CUDA API calls) for profiling, using either the cu(da)ProfilerStart/Stop API or NVTX. A single result for the entire range is then collected, with the limitation that data is only collected for kernels from the first CUDA context found within the range. Also, it only works for ranges covering a single process.
Note that NCCL all reduce kernels are not yet fully supported with this version of range replay, meaning that it is possible to hang intermittently. Still, it will work in many cases.
For the NCCL all_reduce_perf test, a possible range is in common.cu lines 621ff
// Performance Benchmark
auto start = std::chrono::high_resolution_clock::now();
cudaProfilerStart();
for (int iter = 0; iter < iters; iter++) {
if (agg_iters>1) NCCLCHECK(ncclGroupStart());
for (int aiter = 0; aiter < agg_iters; aiter++) {
TESTCHECK(startColl(args, type, op, root, in_place, iter*agg_iters+aiter));
}
if (agg_iters>1) NCCLCHECK(ncclGroupEnd());
}
cudaProfilerStop();
You should explicitly specify the range replay option when running the profiler
For example: ncu --replay-mode range ./build/all_reduce_perf -g 2 -n 1 -w 0 -b 2M -e 2M -c 0
At least it worked for me
But I had a need to profile an application with network communications. I was trying to run the all_reduce_perf compiled with MPI=1 and run a 2 MPI process.
mpirun -n 2 ncu --target-processes all --replay-mode range ./all_reduce_perf -g 3 -n 1 -c 0
Nsight Compute range replay across different processes is not yet supported, unfortunately. We will be looking into this for a future release. You could consider Nsight System’s GPU metric sampling functionality to get some limited metric values sampled over time.
Nsight Compute range replay across different processes is now supported using the new app-range replay mode starting from Nsight Compute version 2023.1 (CUDA 12.1) . The new app-range replay mode profiles ranges without API capture by relaunching the entire application multiple times. After setting an appropriate range (using profiler start/stop API or NVTX ranges), applications using nccl can now be profiled with --replay-mode app-range.
Hi,
I would like to profile nccl kernels in pytorch code using nsight compute. I am able to profile the nccl-test/build/all_reduce_perf using --replay-mode app-range but for pytorch code it always hangs.
My code is the following:
“”“run.py:”“”
#!/usr/bin/env python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
From this snippet, it looks like you are attempting to profile all the ranks that each have a range as they are launched in parallel, which could cause the hang. Can you try to only call the torch.cuda.cudart().cudaProfilerStart()/Stop() for a single rank with something like:
Hi,
I have questions related to the profiling results of running pytorch distributed data parallel training with 2 gpus(connected with NVlink) I got by running ncu --target-processes all --replay-mode app-range --set nvlink --metrics nvlrx__bytes.sum,nvltx__bytes.sum,nvlrx__bytes_data_user.sum,nvltx__bytes_data_user.sum,pcie__read_bytes.sum,pcie__write_bytes.sum,nvltx__bytes.sum.per_second,nvlrx__bytes.sum.per_second python main_nsysprofiler.py -a resnext101_32x8d --dist-url 'tcp://127.0.0.1:13421' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 -b 128 --epochs 1 /data/datasets/imagenet
The results are following, I ran twice to get the results for each gpu.
I would like to know how nvltx/mvlrx__bytes.sum.per_second is calculated. I know the numerator is nvlrx/nvltx__bytes.sum but I don’t know the denominator. Is it the time spent executing the range? Or is it the time data gets transferred on NVlink fabric? How is it related to the nccl_allreduce kernel time? I know I can get the kernel execution time from Nsight system but I am not sure how to get it using Nsight compute.
The bandwidth/utilization is very low in my results. Is this a normal value as we expected? Hope someone can shed some light on it. Thanks!
The denominator is the wall-clock time it took to execute the range from begin() to end(). This value is collected with the metric “gpu__time_duration.sum”. Because you need to use ranges, Nsight Compute doesn’t have metrics for individual kernels. In this case, the best way to get the nccl_allreduce kernel time is probably from Nsight Systems. Or if your range happens to only contain that kernel, the range time in Nsight Compute may be close. With respect to the bandwidth utilization, that’s a difficult question to answer and dependent on the application.
Hi,
Thanks for your reply. That makes sense. I will use the nccl_allreduce kernel time from Nsight System. I have one more question related to the metrics in the nvlink set. Why is there received/transmitted overhead bytes? Is that related to the algorithm used by NCCL library? It seems the overhead is round 80% of the useful data.
The overhead bytes are protocol overhead for using nvlink, and not specific to nccl. It’s hard to say why the ratio is what it is. Perhaps the algorithm is only sending small amounts of data per transmission. You may need to talk with the nccl team or dig deeper into the perf analysis.