Error in sampling pytroch profile with nsys and dlprof

Hi, I have a pytorch training workflow which when profiled through nsys (or through dlprof by adding extra line: import nvidia_dlprof_pytorch_nvtx as nvtx and initiating training look within the context torch.autograd.profiler.emit_nvtx()) which gives me the following error at the end of profiling:

Creating final output files...
Processing [===============================================================100%]

**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
Error {
  Type: RuntimeError
  SubError {
    Type: ProcessEventsError
    Props {
      Items {
        Type: ErrorText
        Value: "/build/agent/work/20a3cfcd1c25021d/QuadD/Host/Analysis/EventHandler/PerfEventHandler.cpp(501): Throw in function void QuadDAnalysis::EventHandler::PerfEventHandler::PutCpuEvent(QuadDCommon::CpuId, QuadDAnalysis::EventHandler::PerfEventHandler::EventPtr)\nDynamic exception type: boost::exception_detail::clone_impl<QuadDAnalysis::ChronologicalOrderError>\nstd::exception::what: ChronologicalOrderError\n[QuadDCommon::tag_message*] = Cpu event chronological order was broken.\n"

These are the following version of installations:

  1. CUDA: 11.3
  2. nsys: 2021.3.2.12-9700a21
  3. dlprof: v1.8.0 built on 2021-12-01 08:22:18 (Build 29839685)

Even the output sqlite file is being recognised as an invalid DLprof database when profiled through dlprof. getting the same errors on two remote systems one with V100 and another with A100.



did you ever resolve the problem?

I’m currently struggling with the same error.

I think i narrowed it down to the DatLoader not running in the main thread, i.e. with >0 workers.

If the data loader iteration happens inside torch.autograd.profiler.emit_nvtx(), it only seems to work when the data loading happens on the main thread (with the workers of the data loader set to 0).