Nsys profile can hang for a long time when profiling pytorch distributed training runs

far.tea4870 · July 30, 2025, 3:21pm

Hi, I can get nsys profile to work for pytorch distributed training runs, but in edge cases it can hang for a long time or even forever. It’s hard to get a good reproduce case for this, so below is a simple reproducer that does a nccl operation that never finishes. This is obviously a silly thing to do, but the behavior that nsys has in this case can be problematic in real code, too. Nsys will attempt to flush all cuda streams and will wait forever, slowly filling up the disk. I think it should time out after 10 seconds or so and write what it has up to that point, instead of continuing to wait.

To reproduce run this python code:

import os
import torch

# boilerplate to set up distributed code
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
torch.distributed.init_process_group(backend="nccl", device_id=device)
rank = torch.distributed.get_rank()
torch.distributed.barrier()


# The test is this: Do an all-reduce on rank 0, but the other ranks don't participate.
# This is not allowed and will never finish. But the behavior of cudaProfilerStop in this
# case is particularly bad. It hangs forever and keeps on writing data to disk. It seems
# to want to sync all operations that are still in progress. It should have a timeout and
# if the sync takes too long, it should just dump whatever data it has.
#
# This is obviously a bit of a silly example, but I have seen examples in non-buggy code
# where cudaProfilerStop has bad behavior. Unfortunately it's hard to make a deterministic
# reproduce case for that. You'll just have to believe me that you can get behavior that
# looks very much like this in real code. cudaProfilerStop doesn't have to literally hang
# forever for this behavior to be a problem.
hang_forever = True
if (not hang_forever) or rank == 0:
    torch.cuda.cudart().cudaProfilerStart()
    data = torch.randn([4, 4, 4], device=device)
    torch.distributed.all_reduce(data)
    print("Calling cudaProfilerStop", flush=True)
    torch.cuda.cudart().cudaProfilerStop()
    print("Finished cudaProfilerStop", flush=True)
    print(data.sum().item())

torch.distributed.destroy_process_group()

Run this on a machine with at least two GPUs using torchrun like this:

nsys profile --output deadlock_repro_%n.nsys-rep --wait=primary --capture-range=cudaProfilerApi --capture-range-end=stop torchrun --nproc_per_node=2 --nnodes=1 -- repro_deadlock.py

The current behavior is that this waits forever. The expected behavior is that it writes out a file and we see the “Finished cudaProfilerStop” message. (then after that the program will hang on the next line, but that’s not your problem) It’s ok if the written file is truncated before the nccl operation.

hwilper · July 31, 2025, 3:30pm

@skottapalli to comment

skottapalli · July 31, 2025, 3:40pm

Which version of nsys are you using?
I believe we do have a timeout of a couple of minutes in the code. I will repro with the example you have provided and get back to you.

far.tea4870 · July 31, 2025, 7:08pm

I reproduced the issue with nsight-systems-2025.3.1

A timeout of a couple minutes is going to be too long for a large pytorch training run. A trace that’s several minutes long would result in an enormous file that the UI would refuse to open. If the concern is that a shorter timeout would be bad for some users, can you make it configurable?

skottapalli · October 24, 2025, 10:23pm

I apologize for the delay in getting back to you. By default nsys will call cuCtxSynchronize inside the cudaProfilerStop call before flushing all the internal buffers that contain the CUDA trace data. We are aware that it could cause problems when there are multiple contexts (which is the case when there are multiple GPUs).

Could you add --flush-on-cudaprofilerstop=false to your command line like so to see if it helps?

nsys profile --output deadlock_repro_%n.nsys-rep --wait=primary --capture-range=cudaProfilerApi --capture-range-end=stop –flush-on-cudaprofilerstop=false torchrun --nproc_per_node=2 --nnodes=1 – repro_deadlock.py

Topic		Replies	Views
Nsys hangs when profiling any cuda process Profiling Linux Targets cuda	1	327	August 11, 2025
Nsys hangs when profile cuda applications Profiling Linux Targets	10	1190	March 8, 2024
Question when Prifilling Megatron-LM Profiling Linux Targets cudnn , llama	8	141	November 14, 2025
Nsight systems profiler causes application crash after running for a while Profiling Linux Targets	8	58	January 20, 2026
Profiling production server while it serves live requests Profiling Linux Targets cuda	7	711	January 9, 2024
Nsys profile failed when using pytorch cudagraph Profiling Linux Targets pytorch	4	583	June 26, 2024
Execute multi GPU with nsys profile command but GPU may be locked Profiling x86 Windows Targets	6	1138	June 17, 2024
Error in sampling pytroch profile with nsys and dlprof Profiling Linux Targets nsight	3	2124	October 7, 2023
Nsys profile with horovod leading to GPU stalling for multiple GPUs (A100) Profiling Linux Targets nsight	1	1698	November 20, 2021
Excessive CUDA profiling data flush Profiling Linux Targets cuda , nsight , deepstream	4	1092	March 15, 2023

Nsys profile can hang for a long time when profiling pytorch distributed training runs

Related topics