Nsight systems profiler causes application crash after running for a while

I’m experiencing a crash when profiling my PyTorch application with Nsight Systems (Version 2025.06.01). The application runs normally without profiling, but crashes after running for a while when the profiler is enabled.

Environment:

  • Nsight Systems Version: 2025.06.01
  • Framework: PyTorch (distributed training)
  • Model: ResNet50 on ImageNet

Application Command (works fine):

torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 \
  --master_addr=localhost --master_port=23456 \
  imagenet.py /mnt/data/dataset/imagenet/ILSVRC/Data/CLS-LOC \
  --epochs 1 -a resnet50 --batch-size 32 --workers 8 --print-freq 5

Profiler Command (causes crash):

nsys profile \
  --trace cuda,nvtx,osrt \
  --pytorch=autograd-nvtx,functions-trace \
  --stop-on-exit=true \
  --nic-metrics=true \
  torchrun --nnodes=1 --nproc_per_node=1 --node_rank=0 \
  --master_addr=localhost --master_port=23456 \
  imagenet.py /mnt/data/dataset/imagenet/ILSVRC/Data/CLS-LOC \
  --epochs 1 -a resnet50 --batch-size 32 --workers 8 --print-freq 5

Log Output Before Crash:

Epoch: [0][  541/40037]	Time  0.049 ( 0.083)	Data  0.000 ( 0.031)	Loss 6.8578e+00 (7.0407e+00)	Acc@1   0.00 (  0.16)	Acc@5   0.00 (  0.70)
Epoch: [0][  546/40037]	Time  0.050 ( 0.083)	Data  0.000 ( 0.031)	Loss 6.9120e+00 (7.0398e+00)	Acc@1   0.00 (  0.17)	Acc@5   0.00 (  0.70)
E0112 09:35:00.784000 140351560521600 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -11) local_rank: 0 (pid: 102119) of binary: /usr/bin/python
I0112 09:35:00.785000 140351560521600 torch/distributed/elastic/multiprocessing/errors/__init__.py:361] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
imagenet.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-01-12_09:35:00
  host      : test2
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 102119)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 102119
========================================================
[I112 09:35:00.241150518 TCPStoreLibUvBackend.cpp:1095] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked
[I112 09:35:00.241214082 TCPStoreLibUvBackend.cpp:1105] [c10d] uv_loop cleanup finished.
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/nsys-report-af74.qdstrm'
[1/1] [========================100%] nsys-report-125b.nsys-rep
Generated:
	/tmp/nsys-report-125b.nsys-rep

The crash seems from the nsys library.

Could someone help me take a look at this issue?

Thank you for reporting this.
I will try to reproduce the issue and take a look.
If you are able to identify the problematic nsys flag it could be helpful (is the issue occurs without --nic-metrics/--pytorch)?

Without --nic-metrics=true the program runs normally; enabling --nic-metrics=true causes it to crash. --pytorch appears to have no effect.

Thank you very much for this information. I will take a look and update.

@ywsample Could you please share the output for nsys status --network?

Does nsys profile --duration=10 --nic-metrics=true work correctly? Do not provide an app, it will do a system-wide collection.

nsys status --network output:

Network Profiling Environment Check
- OFED version: MLNX_OFED_LINUX-5.8-2.0.3.0
- Network Interface Card (NIC): Available
- Network features' library dependencies: OK

Additionally, basic network profiling works correctly with the following command:
nsys profile --duration=10 --nic-metrics=true

Thanks for sharing this @ywsample .
The NIC profiling feature seems to work correctly on its own.

From your second message in this thread, the error is coming from the injection library.
I have a few suggestions to move forward.

  • Is it possible to share a minimal code that reliably reproduces the issue?
  • Are you comfortable with collecting and sharing logs from a crash? See details for this below, you can personally message me the log file if you don’t want to post it publicly.
  • Did you try different configurations of the --trace options? Removing all the other CLI options for nsys, does the crash reliably reproduce when you add a specific tracing option? E.g., --trace=cuda or --trace=nvtx or --trace=osrt. Or maybe a combination of two, --trace=cuda,nvtx.

To collect logs:

  • Create a file in the same directory with the nsys executable, name the file nvlog.config
  • The file should contain
+ 100iw 100ef 0IW 0EF   global
- quadd_verbose_

$ /tmp/nsys.log

ForceFlush

Format $sevc$time|${name:0}|${tid:5}|${file:0}:${line:0}[${sfunc:0}]: $text
  • Do a profiling that reproduces the crash. When the profiling session ends, there should be a nsys.log file under /tmp. Please share that file.