I’m experiencing a crash when profiling my PyTorch application with Nsight Systems (Version 2025.06.01). The application runs normally without profiling, but crashes after running for a while when the profiler is enabled.
Environment:
- Nsight Systems Version: 2025.06.01
- Framework: PyTorch (distributed training)
- Model: ResNet50 on ImageNet
Application Command (works fine):
torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 \
--master_addr=localhost --master_port=23456 \
imagenet.py /mnt/data/dataset/imagenet/ILSVRC/Data/CLS-LOC \
--epochs 1 -a resnet50 --batch-size 32 --workers 8 --print-freq 5
Profiler Command (causes crash):
nsys profile \
--trace cuda,nvtx,osrt \
--pytorch=autograd-nvtx,functions-trace \
--stop-on-exit=true \
--nic-metrics=true \
torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 \
--master_addr=localhost --master_port=23456 \
imagenet.py /mnt/data/dataset/imagenet/ILSVRC/Data/CLS-LOC \
--epochs 1 -a resnet50 --batch-size 32 --workers 8 --print-freq 5
Log Output Before Crash:
Epoch: [0][ 541/40037] Time 0.049 ( 0.083) Data 0.000 ( 0.031) Loss 6.8578e+00 (7.0407e+00) Acc@1 0.00 ( 0.16) Acc@5 0.00 ( 0.70)
Epoch: [0][ 546/40037] Time 0.050 ( 0.083) Data 0.000 ( 0.031) Loss 6.9120e+00 (7.0398e+00) Acc@1 0.00 ( 0.17) Acc@5 0.00 ( 0.70)
E0112 09:35:00.784000 140351560521600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 102119) of binary: /usr/bin/python
I0112 09:35:00.785000 140351560521600 torch/distributed/elastic/multiprocessing/errors/__init__.py:361] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
imagenet.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-01-12_09:35:00
host : test2
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 102119)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 102119
========================================================
[I112 09:35:00.241150518 TCPStoreLibUvBackend.cpp:1095] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked
[I112 09:35:00.241214082 TCPStoreLibUvBackend.cpp:1105] [c10d] uv_loop cleanup finished.
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/nsys-report-af74.qdstrm'
[1/1] [========================100%] nsys-report-125b.nsys-rep
Generated:
/tmp/nsys-report-125b.nsys-rep
