Error in sampling pytroch profile with nsys and dlprof

pandamit · February 6, 2022, 7:46am

Hi, I have a pytorch training workflow which when profiled through nsys (or through dlprof by adding extra line: import nvidia_dlprof_pytorch_nvtx as nvtx and initiating training look within the context torch.autograd.profiler.emit_nvtx()) which gives me the following error at the end of profiling:

Creating final output files...
Processing [===============================================================100%]

**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/EventHandler/PerfEventHandler.cpp(501): Throw in function void QuadDAnalysis::EventHandler::PerfEventHandler::PutCpuEvent(QuadDCommon::CpuId, QuadDAnalysis::EventHandler::PerfEventHandler::EventPtr)\nDynamic exception type: boost::exception_detail::clone_impl<QuadDAnalysis::ChronologicalOrderError>\nstd::exception::what: ChronologicalOrderError\n[QuadDCommon::tag_message*] = Cpu event chronological order was broken.\n"
      }
    }
  }
}

These are the following version of installations:

CUDA: 11.3
nsys: 2021.3.2.12-9700a21
dlprof: v1.8.0 built on 2021-12-01 08:22:18 (Build 29839685)

Even the output sqlite file is being recognised as an invalid DLprof database when profiled through dlprof. getting the same errors on two remote systems one with V100 and another with A100.

orli · July 20, 2022, 4:09pm

Hi,

did you ever resolve the problem?

I’m currently struggling with the same error.

orli · July 20, 2022, 4:47pm

I think i narrowed it down to the DatLoader not running in the main thread, i.e. with >0 workers.

If the data loader iteration happens inside torch.autograd.profiler.emit_nvtx(), it only seems to work when the data loading happens on the main thread (with the workers of the data loader set to 0).

yuruiii · October 7, 2023, 7:19am

I encountered the same problem and I solved it with the following method:

First execute nsys status --environment to examine the environment, mine result is:

Timestamp counter supported: Yes
Sampling Environment Check
Linux Kernel Paranoid Level = 4: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-84-generic: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
Sampling Environment: Fail

The problem is Linux Kernel Paranoid Level = 4 is too high. It must be <=2 as mentioned in Installation Guide :: Nsight Systems Documentation (nvidia.com).

So I executesudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid' to set it to 2. After that it works fine.

Topic		Replies	Views
Nsight System profiling error Profiling Linux Targets cuda , kernel , python	3	1149	January 27, 2023
Error in nsys profiling of python code Profiling Linux Targets	4	435	April 25, 2024
DLProf error during report generation Profiling Linux Targets	5	740	February 16, 2023
Problem when i launch nsys profile Profiling Linux Targets	2	1050	November 3, 2023
Wrong event order has been detected when adding events to the collection Profiling Linux Targets cudnn	1	479	April 23, 2024
Nsys profile error : InvalidArgumentException Profiling Linux Targets nsight	1	727	September 8, 2023
DLProf Error Visual Profiler and nvprof nsight	3	1612	April 7, 2023
Dlprof not generating event files Profiling Linux Targets nsight	0	655	May 18, 2021
CPU sampling in privileged Docker container via `sudo nsys` Profiling Linux Targets docker	4	2861	July 6, 2022
Linux Kernel Paranoid Level = -1: OK Profiling Linux Targets cuda , nsight , docker , containers	13	3057	June 21, 2022

Error in sampling pytroch profile with nsys and dlprof

Related topics