Hi, I can get nsys profile to work for pytorch distributed training runs, but in edge cases it can hang for a long time or even forever. It’s hard to get a good reproduce case for this, so below is a simple reproducer that does a nccl operation that never finishes. This is obviously a silly thing to do, but the behavior that nsys has in this case can be problematic in real code, too. Nsys will attempt to flush all cuda streams and will wait forever, slowly filling up the disk. I think it should time out after 10 seconds or so and write what it has up to that point, instead of continuing to wait.
To reproduce run this python code:
import os
import torch
# boilerplate to set up distributed code
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
torch.distributed.init_process_group(backend="nccl", device_id=device)
rank = torch.distributed.get_rank()
torch.distributed.barrier()
# The test is this: Do an all-reduce on rank 0, but the other ranks don't participate.
# This is not allowed and will never finish. But the behavior of cudaProfilerStop in this
# case is particularly bad. It hangs forever and keeps on writing data to disk. It seems
# to want to sync all operations that are still in progress. It should have a timeout and
# if the sync takes too long, it should just dump whatever data it has.
#
# This is obviously a bit of a silly example, but I have seen examples in non-buggy code
# where cudaProfilerStop has bad behavior. Unfortunately it's hard to make a deterministic
# reproduce case for that. You'll just have to believe me that you can get behavior that
# looks very much like this in real code. cudaProfilerStop doesn't have to literally hang
# forever for this behavior to be a problem.
hang_forever = True
if (not hang_forever) or rank == 0:
torch.cuda.cudart().cudaProfilerStart()
data = torch.randn([4, 4, 4], device=device)
torch.distributed.all_reduce(data)
print("Calling cudaProfilerStop", flush=True)
torch.cuda.cudart().cudaProfilerStop()
print("Finished cudaProfilerStop", flush=True)
print(data.sum().item())
torch.distributed.destroy_process_group()
Run this on a machine with at least two GPUs using torchrun like this:
nsys profile --output deadlock_repro_%n.nsys-rep --wait=primary --capture-range=cudaProfilerApi --capture-range-end=stop torchrun --nproc_per_node=2 --nnodes=1 -- repro_deadlock.py
The current behavior is that this waits forever. The expected behavior is that it writes out a file and we see the “Finished cudaProfilerStop” message. (then after that the program will hang on the next line, but that’s not your problem) It’s ok if the written file is truncated before the nccl operation.