I tried running
ncu --target-processes all --replay-mode application -k regex:cross_device -o prof_report -f python
to profile an allreduce kernel but it keeps hanging the process for me
==PROF== Profiling "cross_device_reduce_2stage": Application replay pass 1
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using '--replay-mode app-range'. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#replay for more details.
Any ideas on what’s happening here?