What is the meaning of error in Nsight UI Diagnostics Summary

Can someone tell me what this error means?

Error	Analysis		00:00.402	
Event requestor failed: Source ID=
Type=ErrorInformation (18)
 Properties:
  ErrorText (100)=GPU Metrics [0]: NVPA_STATUS_ERROR
- API function: NVPW_Device_PeriodicSampler_DecodeCounters(&decodeParams)
- Error code: 1
- Source function: size_t QuadDDaemon::EventSource::GpuMetricsBackend::Impl::Collect(QuadDDaemon::EventSource::GpuMetricsBackend::CounterDataImage*)
- Source location: /build/agent/work/323cb361ab84164c/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetricsBackend.cpp:1329

We were trying to trace the object detection application from GitHub - mlcommons/training_results_v2.1 against 2 A100 GPUs:

+ NSYSCMD=' /usr/local/cuda/bin/nsys profile -t cuda,nvtx,osrt,cublas --trace-fork-before-exec true -f true --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite -o /results/object_detection_pytorch_1x2x12_230126154320094202263.nsys-rep'

We’re just getting started with this stuff and any recommendations would be greatly appreciated. We’ve been able to monitor smaller application runs. I’m sure its some configuration issue… I’ve attached logs for our run.

230126104234615857162_1.log (452.8 KB)

Also note that we were successful in running the tests without nsys (1 or 2 GPUs). Not 100% of the time though.
T

@Andrey_Trachenko who is the right person to work on this?

Having the same issue with a simpler test. Using GitHub - undertherain/benchmarker: modular framework for [not only] deep learning performance benchmarking
.
Running the following command:

root@4f49d25a37a0:/rockshare/user/tniro/benchmarker#  nsys profile --sample=none  -t cuda,nvtx,cublas -f true -o /rockshare/user/tniro/gpu_benchmarks/proxy/resnet/syseng5/gpu1/gpu1 --gpu-metrics-device=all --cuda-memory-usage=true --export=sqlite python3 -m benchmarker  --mode=inference --framework=pytorch --problem=resnet50 --problem_size=16384 --batch_size=1024 --gpus=1 --nb_epoch=20
GPU 0: General Metrics for NVIDIA GA100 (any frequency)
GPU 1: General Metrics for NVIDIA GA100 (any frequency)
{
    "backend": "native",
    "batch_size": 1024,
    "batch_size_per_device": 1024,
    "channels_first": true,
    "cudnn_benchmark": true,
    "device": "NVIDIA A100 80GB PCIe",
    "framework": "pytorch",
    "framework_full": "PyTorch-1.14.0a0+410ce96",
    "gpus": [
        1
    ],
    "mode": "inference",
    "nb_epoch": 20,
    "nb_gpus": 1,
    "path_ext": "inference",
    "path_out": "./logs",
    "platform": {
        "cpu": {
            "brand": "Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz",
            "cache": {
                "1": 3145728,
                "2": 83886080,
                "3": 50331648
            },
            "clock": 966.5690390625,
            "clock_max": 3200.0,
            "clock_min": 800.0,
            "logical_cores": 128,
            "physical_cores": 64
        },
        "gpus": [
            {
                "brand": "NVIDIA A100 80GB PCIe",
                "clock": 1410000,
                "compute_capability": 8.0,
                "cores": null,
                "memory": 85024112640,
                "memory_clock": 1512000,
                "multiprocessors": 108,
                "warp_size": 32
            }
        ],
        "hdds": {
            "/dev/sda": {
                "model": "PERC H745 Frnt  ",
                "size": 936640512
            },
            "/dev/sdb": {
                "model": "DELLBOSS VD     ",
                "size": 937571968
            }
        },
        "host": "4f49d25a37a0",
        "os": "Linux-5.15.0-58-generic-x86_64-with-glibc2.29",
        "ram": {
            "total": 269870661632
        },
        "swap": 8589930496
    },
    "power": {
        "avg_watt_total": 0,
        "joules_total": 0,
        "sampling_ms": 100
    },
    "preheat": false,
    "problem": {
        "cnt_batches_per_epoch": 16,
        "cnt_samples": 16384,
        "name": "resnet50",
        "precision": "FP32",
        "size": [
            16384,
            3,
            224,
            224
        ]
    },
    "profile_pytorch": false,
    "samples_per_second": 3262.822482960377,
    "start_time": "23.02.02_14.08.59",
    "tensor_layout": "native",
    "time_batch": 0.31383871030302546,
    "time_epoch": 5.021419364848407,
    "time_sample": 0.0003064831155302983,
    "time_total": 100.42838729696814
}
Generating '/tmp/nsys-report-0904.qdstrm'
[1/2] [========================100%] gpu1.nsys-rep
Importer error status: Importation succeeded with non-fatal errors.
**** Analysis failed with:
Status: TargetProfilingFailed
Props {
  Items {
    Type: DeviceId
    Value: "Local (CLI)"
  }
}
Error {
  Type: RuntimeError
  Props {
    Items {
      Type: ErrorText
      Value: "GPU Metrics [0]: NVPA_STATUS_ERROR\n- API function: NVPW_Device_PeriodicSampler_DecodeCounters(&decodeParams)\n- Error code: 1\n- Source function: size_t QuadDDaemon::EventSource::GpuMetricsBackend::Impl::Collect(QuadDDaemon::EventSource::GpuMetricsBackend::CounterDataImage*)\n- Source location: /build/agent/work/323cb361ab84164c/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetricsBackend.cpp:1357"
    }
  }
}


**** Errors occurred while processing the raw events. ****
**** Please see the Diagnostics Summary page after opening the report file in GUI. ****
[2/2] [========================100%] gpu1.sqlite
Generated:
    /rockshare/user/tniro/gpu_benchmarks/proxy/resnet/syseng5/gpu1/gpu1.qdstrm
    /rockshare/user/tniro/gpu_benchmarks/proxy/resnet/syseng5/gpu1/gpu1.nsys-rep
    /rockshare/user/tniro/gpu_benchmarks/proxy/resnet/syseng5/gpu1/gpu1.sqlite

Using container: nvcr.io/nvidia/pytorch:22.12-py3. The following is the nsys status:

root@4f49d25a37a0:/rockshare/user/tniro/benchmarker# nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: enabled
Linux Kernel Paranoid Level = 2
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-58-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

Including the generated report:
gpu0.nsys-rep.gz (7.3 MB)

NOTE: we run the same setup but with an AMD server and have no issues.