NCU hangs when trying to profile a multi gpu kernel

szymon.ozog · December 11, 2024, 2:20pm

I tried running
ncu --target-processes all --replay-mode application -k regex:cross_device -o prof_report -f python
to profile an allreduce kernel but it keeps hanging the process for me

==PROF== Profiling "cross_device_reduce_2stage": Application replay pass 1
==WARNING== Launching the workload is taking more time than expected. If this continues to hang, terminate the profile and re-try by profiling the range of all related launches using '--replay-mode app-range'. See https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#replay for more details.

Any ideas on what’s happening here?

veraj · December 23, 2024, 6:39am

Hi, @szymon.ozog

The warning has provided the solution and related doc, have you tried ?

szymon.ozog · December 24, 2024, 1:14pm

Sadly the kernel still hanged when running with app-range, the workaround I managed to get working is profiling the kernels on one gpu at the time:

if rank == x:
    cuProfilerStart()

veraj · January 2, 2025, 7:14am

Is it possible to provide a repro to us ？

szymon.ozog · January 8, 2025, 2:53pm

Sadly I no longer have access to a machine that can run this kernel. I hope that trying to profile a VLLM repo that I linked in the post will result in the same error. Feel free to reach out if you have any questions

Topic		Replies	Views
Ncu unable to profile application with managed memory Nsight Compute	3	252	April 28, 2025
NSIGHT Compute hangs at profiling CUDA application Nsight Compute	1	654	July 20, 2023
Nsight compute hanging issue Nsight Compute kernel	7	929	March 11, 2024
Random Freezing Trying to Profile Megatron-LM on Multiple GPUs Nsight Compute	9	867	July 22, 2024
Ncu profile file not created Nsight Compute	5	1130	September 1, 2021
Is not there a replay-mode option? Nsight Compute	1	812	July 24, 2019
NCU and Nsys hangs Indefinitely Profiling Linux Targets	2	68	March 27, 2025
Profiling fails on more than one gpu device Nsight Compute	9	1026	November 15, 2023
Failed to access the following 9 metrics Nsight Compute	2	414	March 27, 2024
Application GUI freezes after NSIGHT Compute profiler is connected Nsight Compute	11	1331	April 12, 2023

NCU hangs when trying to profile a multi gpu kernel

Related topics