Wrong event order has been detected when adding events to the collection

I am running this script within slurm batch file to profile a distributed DNN training script.

srun nsys profile --sample=none --cpuctxsw=none --nic-metrics=true --gpu-metrics-device=0  --trace=cuda,nvtx,cudnn --output=logs/nsys_logs%h torchrun $script

Then I visualize with Nsight System

I face this error

Error {
  Type: RuntimeError
  SubError {
    Type: InvalidArgument
    Props {
      Items {
        Type: OriginalExceptionClass
        Value: "N5boost10wrapexceptIN11QuadDCommon24InvalidArgumentExceptionEEE"
      }
      Items {
        Type: OriginalFile
        Value: "/dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.4/QuadD/Host/Analysis/Modules/EventCollection.cpp"
      }
      Items {
        Type: OriginalLine
        Value: "1048"
      }
      Items {
        Type: OriginalFunction
        Value: "void QuadDAnalysis::EventCollection::CheckOrder(QuadDAnalysis::EventCollectionHelper::EventContainer&, const QuadDAnalysis::ConstEvent&) const"
      }
      Items {
        Type: ErrorText
        Value: "Wrong event order has been detected when adding events to the collection:\nnew event ={ StartNs=103672259569 StopNs=103672285810 GlobalId=281867949464447 Event={ TraceProcessEvent=[{ Correlation=3224309 EventClass=0 TextId=864 ReturnValue=0 },] } Type=48 }\nlast event ={ StartNs=108569797108 StopNs=108588751644 GlobalId=281867949464447 Event={ TraceProcessEvent=[{ Correlation=3399850 EventClass=0 TextId=1051 ReturnValue=0 },] } Type=48 }"
      }
    }
  }
}

System Details

I visualize on this device

Nsight Systems:
Chip: Apple M2
OS: 14.2 (23C64)
Version: 2024.1.1.59-241133802077v0 OSX.
Qt version: 6.3.2.
Google Protocol Buffers version: 3.21.1.
Boost version: 1.78.0.

I run experiments on these devices:

$ nsys -v
NVIDIA Nsight Systems version 2023.4.4.54-234433681190v0
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:81:00.0 Off |                    0 |
| N/A   44C    P0             27W /   70W |       0MiB /  15360MiB |      7%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

I do not have root access since it’s a Supercomputer env that I don’t control.

Please try to profile on the target with Nsight Systems 2024.2.

This is probably a bug that has been fixed.