Nsys profile can hang for a long time when profiling pytorch distributed training runs

Hi, I can get nsys profile to work for pytorch distributed training runs, but in edge cases it can hang for a long time or even forever. It’s hard to get a good reproduce case for this, so below is a simple reproducer that does a nccl operation that never finishes. This is obviously a silly thing to do, but the behavior that nsys has in this case can be problematic in real code, too. Nsys will attempt to flush all cuda streams and will wait forever, slowly filling up the disk. I think it should time out after 10 seconds or so and write what it has up to that point, instead of continuing to wait.

To reproduce run this python code:

import os
import torch

# boilerplate to set up distributed code
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
torch.distributed.init_process_group(backend="nccl", device_id=device)
rank = torch.distributed.get_rank()
torch.distributed.barrier()


# The test is this: Do an all-reduce on rank 0, but the other ranks don't participate.
# This is not allowed and will never finish. But the behavior of cudaProfilerStop in this
# case is particularly bad. It hangs forever and keeps on writing data to disk. It seems
# to want to sync all operations that are still in progress. It should have a timeout and
# if the sync takes too long, it should just dump whatever data it has.
#
# This is obviously a bit of a silly example, but I have seen examples in non-buggy code
# where cudaProfilerStop has bad behavior. Unfortunately it's hard to make a deterministic
# reproduce case for that. You'll just have to believe me that you can get behavior that
# looks very much like this in real code. cudaProfilerStop doesn't have to literally hang
# forever for this behavior to be a problem.
hang_forever = True
if (not hang_forever) or rank == 0:
    torch.cuda.cudart().cudaProfilerStart()
    data = torch.randn([4, 4, 4], device=device)
    torch.distributed.all_reduce(data)
    print("Calling cudaProfilerStop", flush=True)
    torch.cuda.cudart().cudaProfilerStop()
    print("Finished cudaProfilerStop", flush=True)
    print(data.sum().item())

torch.distributed.destroy_process_group()

Run this on a machine with at least two GPUs using torchrun like this:

nsys profile --output deadlock_repro_%n.nsys-rep --wait=primary --capture-range=cudaProfilerApi --capture-range-end=stop torchrun --nproc_per_node=2 --nnodes=1 -- repro_deadlock.py

The current behavior is that this waits forever. The expected behavior is that it writes out a file and we see the “Finished cudaProfilerStop” message. (then after that the program will hang on the next line, but that’s not your problem) It’s ok if the written file is truncated before the nccl operation.

@skottapalli to comment

Which version of nsys are you using?
I believe we do have a timeout of a couple of minutes in the code. I will repro with the example you have provided and get back to you.

I reproduced the issue with nsight-systems-2025.3.1

A timeout of a couple minutes is going to be too long for a large pytorch training run. A trace that’s several minutes long would result in an enormous file that the UI would refuse to open. If the concern is that a shorter timeout would be bad for some users, can you make it configurable?

I apologize for the delay in getting back to you. By default nsys will call cuCtxSynchronize inside the cudaProfilerStop call before flushing all the internal buffers that contain the CUDA trace data. We are aware that it could cause problems when there are multiple contexts (which is the case when there are multiple GPUs).

Could you add --flush-on-cudaprofilerstop=false to your command line like so to see if it helps?

nsys profile --output deadlock_repro_%n.nsys-rep --wait=primary --capture-range=cudaProfilerApi --capture-range-end=stop –flush-on-cudaprofilerstop=false torchrun --nproc_per_node=2 --nnodes=1 – repro_deadlock.py