NCU hangs when trying to profile a multi gpu kernel

I tried running
ncu --target-processes all --replay-mode application -k regex:cross_device -o prof_report -f python
to profile an allreduce kernel but it keeps hanging the process for me

==PROF== Profiling "cross_device_reduce_2stage": Application replay pass 1
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using '--replay-mode app-range'. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#replay for more details.

Any ideas on what’s happening here?

Hi, @szymon.ozog

The warning has provided the solution and related doc, have you tried ?

Sadly the kernel still hanged when running with app-range, the workaround I managed to get working is profiling the kernels on one gpu at the time:

if rank == x:
    cuProfilerStart()

Is it possible to provide a repro to us ?

Sadly I no longer have access to a machine that can run this kernel. I hope that trying to profile a VLLM repo that I linked in the post will result in the same error. Feel free to reach out if you have any questions