I am running this script within slurm batch file to profile a distributed DNN training script.
srun nsys profile --sample=none --cpuctxsw=none --nic-metrics=true --gpu-metrics-device=0 --trace=cuda,nvtx,cudnn --output=logs/nsys_logs%h torchrun $script
Then I visualize with Nsight System
I face this error
Error {
Type: RuntimeError
SubError {
Type: InvalidArgument
Props {
Items {
Type: OriginalExceptionClass
Value: "N5boost10wrapexceptIN11QuadDCommon24InvalidArgumentExceptionEEE"
}
Items {
Type: OriginalFile
Value: "/dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.4/QuadD/Host/Analysis/Modules/EventCollection.cpp"
}
Items {
Type: OriginalLine
Value: "1048"
}
Items {
Type: OriginalFunction
Value: "void QuadDAnalysis::EventCollection::CheckOrder(QuadDAnalysis::EventCollectionHelper::EventContainer&, const QuadDAnalysis::ConstEvent&) const"
}
Items {
Type: ErrorText
Value: "Wrong event order has been detected when adding events to the collection:\nnew event ={ StartNs=103672259569 StopNs=103672285810 GlobalId=281867949464447 Event={ TraceProcessEvent=[{ Correlation=3224309 EventClass=0 TextId=864 ReturnValue=0 },] } Type=48 }\nlast event ={ StartNs=108569797108 StopNs=108588751644 GlobalId=281867949464447 Event={ TraceProcessEvent=[{ Correlation=3399850 EventClass=0 TextId=1051 ReturnValue=0 },] } Type=48 }"
}
}
}
}
System Details
I visualize on this device
Nsight Systems:
Chip: Apple M2
OS: 14.2 (23C64)
Version: 2024.1.1.59-241133802077v0 OSX.
Qt version: 6.3.2.
Google Protocol Buffers version: 3.21.1.
Boost version: 1.78.0.
I run experiments on these devices:
$ nsys -v
NVIDIA Nsight Systems version 2023.4.4.54-234433681190v0
$ nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000000:81:00.0 Off | 0 |
| N/A 44C P0 27W / 70W | 0MiB / 15360MiB | 7% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail
I do not have root access since it’s a Supercomputer env that I don’t control.