Question about profiling nccl kernels with Nsight Compute

Hi,
I would like to profile nccl kernel and get some detail metrics by using nsight compute, but it always hang. Can anybody give me some information about this? Thanks.

Add more details:
tested on NCG container: nvcr.io/nvidia/pytorch:21.07-py3
application: nccl-test/build/all_reduce_perf

PS: There has been an same issue reported in github , but no conclusion yet.

Nsight Compute serializes kernel launches across all profiled processes. If a kernel waits for other concurrent processes (or kernels) it will not be able to make forward progress and the profiling will hang. So such applications cannot be profiled using Nsight Compute.

Hi Sanjiv,
Thanks. And is there any plan to make ncu support ncck kernel profiling?

Yes, we are looking into supporting these types of applications in the future, but there is no definite timeline for such support to be released, yet.

hi there,
it seems we are hitting the same issue (Profiling all_reduce_perf with Nsight hangs · Issue #101 · NVIDIA/nccl-tests · GitHub) – is there any updates on the timeline for supporting this?

thanks!

Nsight Compute 2022.1 includes a new Range Replay feature to support profiling mandatory concurrent kernels (such as NCC all reduce). Range replay requires you to mark explicit ranges of kernels (and CUDA API calls) for profiling, using either the cu(da)ProfilerStart/Stop API or NVTX. A single result for the entire range is then collected, with the limitation that data is only collected for kernels from the first CUDA context found within the range. Also, it only works for ranges covering a single process.

Note that NCCL all reduce kernels are not yet fully supported with this version of range replay, meaning that it is possible to hang intermittently. Still, it will work in many cases.

For the NCCL all_reduce_perf test, a possible range is in common.cu lines 621ff

// Performance Benchmark
auto start = std::chrono::high_resolution_clock::now();
cudaProfilerStart();
for (int iter = 0; iter < iters; iter++) {
  if (agg_iters>1) NCCLCHECK(ncclGroupStart());
  for (int aiter = 0; aiter < agg_iters; aiter++) {
    TESTCHECK(startColl(args, type, op, root, in_place, iter*agg_iters+aiter));
  }
  if (agg_iters>1) NCCLCHECK(ncclGroupEnd());
}
cudaProfilerStop();

Hi felix_dt,

I just followed your steps:

  1. install latest Nsight Compute to 2022.1.1
  2. add cudaProfilerStart() and cudaProfilerStop() before and after startColl()

but it still hang, why ? any other steps I missing ?

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2022 NVIDIA Corporation
Version 2022.1.1.0 (build 30914944) (public-release)

command: ncu --set full -f -o all_reduce_2_ranks_2M ./build/all_reduce_perf -g 2 -n 1 -w 0 -b 2M -e 2M -c 0

nThread 1 nGpus 2 minBytes 2097152 maxBytes 2097152 step: 1048576(bytes) warmup iters: 0 iters: 1 validation: 0

Using devices
==PROF== Connected to process 13632 (githubs/nccl-tests/build/all_reduce_perf)
Rank 0 Pid 13632 on da870a356542 device 0 [0x54] NVIDIA A100-SXM-80GB
Rank 1 Pid 13632 on da870a356542 device 1 [0x5a] NVIDIA A100-SXM-80GB

                                                   out-of-place                       in-place
   size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
    (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)

==PROF== Profiling “ncclKernel_AllReduce_RING_LL_…” - 1:

BTW, I was testing on nvcr.io/nvidia/pytorch:22.03-py3, with latset ncu and nccl 2.12

And after tested all kernels, I found that only broadcast_pref could profile, is that true ?

Thanks

below is the code diff:

diff --git a/src/common.cu b/src/common.cu
index 05f814d…a6e7f58 100644
— a/src/common.cu
+++ b/src/common.cu
@@ -10,6 +10,7 @@
#include <getopt.h>
#include <libgen.h>
#include “cuda.h”
+#include <cuda_profiler_api.h>

int test_ncclVersion = 0; // init’d with ncclGetVersion()

@@ -596,8 +597,8 @@ testResult_t BenchTime(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
}

// Sync
– TESTCHECK(startColl(args, type, op, root, in_place, 0));
– TESTCHECK(completeColl(args));
++ // TESTCHECK(startColl(args, type, op, root, in_place, 0));
++ // TESTCHECK(completeColl(args));

Barrier(args);

@@ -617,6 +618,8 @@ testResult_t BenchTime(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t

// Performance Benchmark
auto start = std::chrono::high_resolution_clock::now();
++ PRINT(“cudaProfilerStart\n”);
++ cudaProfilerStart();
for (int iter = 0; iter < iters; iter++) {
if (agg_iters>1) NCCLCHECK(ncclGroupStart());
for (int aiter = 0; aiter < agg_iters; aiter++) {
@@ -624,6 +627,8 @@ testResult_t BenchTime(struct threadArgs* args, ncclDataType_t type, ncclRedOp_t
}
if (agg_iters>1) NCCLCHECK(ncclGroupEnd());
}
++ cudaProfilerStop();
++ PRINT(“cudaProfilerStop\n”);

Hi!

You should explicitly specify the range replay option when running the profiler
For example:
ncu --replay-mode range ./build/all_reduce_perf -g 2 -n 1 -w 0 -b 2M -e 2M -c 0

At least it worked for me

But I had a need to profile an application with network communications. I was trying to run the all_reduce_perf compiled with MPI=1 and run a 2 MPI process.

mpirun -n 2 ncu --target-processes all --replay-mode range ./all_reduce_perf -g 3 -n 1 -c 0

This is where dependency occurs.

#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
==PROF== Profiling "range" - 1: ==PROF== Profiling "range" - 1:

Can I somehow profile the nccl kernel using mpi?

1 Like

Hi,

Thank you so much, and it works after adding the ‘–replay-mode range’ option.

as for MPI profiling, I didn’t try it before.

Nsight Compute range replay across different processes is not yet supported, unfortunately. We will be looking into this for a future release. You could consider Nsight System’s GPU metric sampling functionality to get some limited metric values sampled over time.

Hi felix,

With Nsight Compute 2022.1, I can now profile NCCL kernel with Range replay feature.

And I found that there is no metrics about Peer Memory in the “Memory workload” Chart.

Could you help confirm it ? missing any options or just tool doesn’t support it yet ?

Thanks