Unable to capture "Can't find UUID for CUDA device"

While using Nsight Systems to capture the same program repeatedly, I suddenly got the error below and am no longer able to profile.

RuntimeError (120) {
    RuntimeError (120) {
        OriginalExceptionClass: N5boost10wrapexceptIN11QuadDCommon16RuntimeExceptionEEE
        OriginalFile: /dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.3/QuadD/Host/Analysis/Clients/AnalysisHelper/AnalysisStatus.cpp
        OriginalLine: 79
        OriginalFunction: static QuadDAnalysis::AnalysisHelper::AnalysisStatus::StatusInfo QuadDAnalysis::AnalysisHelper::AnalysisStatus::MakeFromErrorString(QuadDAnalysis::AnalysisHelper::AnalysisStatus::StatusType, QuadDAnalysis::AnalysisHelper::AnalysisStatus::ErrorType, const string&, const DevicePtr&)
        ErrorText: /dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.3/QuadD/Target/quadd_d/quadd_d/jni/EventSource/Trace.cpp(1855): Throw in function QuadDCommon::Uuid QuadDDaemon::EventSource::Trace::GetCudaDeviceUuidForTimestamp(QuadDCommon::ProcessId, const ConstTraceEvent&, uint64_t) const
            Dynamic exception type: boost::wrapexcept<QuadDCommon::InternalErrorException>
            std::exception::what: InternalErrorException
            [QuadDCommon::tag_message*] = Can't find UUID for CUDA device 0 (PID 5163)
            
    }
}

I’m using “NVIDIA Nsight Systems, 2023.3.3.42-233333266658v0 Linux” on Ubuntu 20.04 with the 525.x series driver with an A6000 (device 0) and 3090 (device 1).

I have tried the following:

  • rebooting as suggested
  • upgrading to latest in 525.x series driver

Nothing, to my knowledge, changed from one session to the other when this first started. Any help would be appreciated. Thanks.

1 Like

I am facing the same issue after upgrading the GPU driver.

My first suggestion is going to be to check and see if you have any zombie Nsys processes on the system.

If there are not (or that doesn’t fix it) what is your Nsys version and driver version?

The same issue occurs after system reboot and I don’t see any nsys processes running.

NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0

$ nsys --version

NVIDIA Nsight Systems version 2023.3.3.42-233333266658v0

$ nsys status --environment

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-84-generic: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

@skottapalli can you please look into this?

  1. What is the full output of nvidia-smi command?
  2. Are you able to profile a simple CUDA toolkit sample like matrixMul with the same combination of the driver version and nsys version?
  3. What is the full nsys command line you are using to profile your app?

The 2023.3.3 nsys version is from the CUDA toolkit (CTK) version 12.3. That nsys version supports CUDA tracing fully when the driver version is <= 525.60.13. It looks like you have a slightly newer driver and that may be causing some issues. Could you try the nsys version from the web release? Nsight Systems - Get Started | NVIDIA Developer | NVIDIA Developer which is newer than the nsys found in CTK 12.3

1 Like
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:67:00.0 Off |                  Off |
| 30%   41C    P8    20W / 300W |     13MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:68:00.0  On |                  N/A |
| 35%   45C    P8    35W / 350W |   1607MiB / 24576MiB |     35%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1856      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2510      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1856      G   /usr/lib/xorg/Xorg                375MiB |
|    1   N/A  N/A      2510      G   /usr/lib/xorg/Xorg                626MiB |
|    1   N/A  N/A      3036      G   /usr/bin/gnome-shell              171MiB |
|    1   N/A  N/A      3794      G   ...veSuggestionsOnlyOnDemand      111MiB |
|    1   N/A  N/A      5173      G   ...RendererForSitePerProcess       14MiB |
|    1   N/A  N/A      5583      G   ...ost-linux-x64/nsys-ui.bin      102MiB |
|    1   N/A  N/A      7841      G   gnome-control-center                4MiB |
|    1   N/A  N/A      8213      G   /usr/lib/firefox/firefox          177MiB |
+-----------------------------------------------------------------------------+
  1. I was able to profile matrixMul using the Nsight UI and default project settings.

  2. I am using the Nsight UI to configure and launch the session. A new project was recreated and the defaults were used. Just changed command line and working directory.

DeviceId: "Local"
EventTypes {
  Items: CpuCycles
  Items: OSRuntime
  Items: Cuda
  Items: NvtxEvents
}
HowToStart: Immediate
HowToStop: Manual
DeviceType: Unix
DeviceDisplayName: "REPLACED"
OSRuntimeOptions {
  DurationThresholdNs: 1000
}
Processes {
  HowToAttach: LaunchAnother
  Command: "/home/REPLACED/repo/REPLACED/venv/bin/python"
  Arguments: "-m"
  Arguments: "REPLACED"
  WorkingDirectory: "/home/REPLACED/repo/REPLACED"
  UserName: "REPLACED"
  CollectNvtxTrace: true
  CollectCudaTrace: true
  EnvironmentVariables {
    Name: "DESKTOP_SESSION"
    Value: "ubuntu"
  }
  EnvironmentVariables {
    Name: "DISPLAY"
    Value: ":1"
  }
  EnvironmentVariables {
    Name: "GDMSESSION"
    Value: "ubuntu"
  }
  EnvironmentVariables {
    Name: "GIO_LAUNCHED_DESKTOP_FILE"
    Value: "/usr/share/applications/nsys-ui-2023.3.3.desktop"
  }
  EnvironmentVariables {
    Name: "GIO_LAUNCHED_DESKTOP_FILE_PID"
    Value: "5539"
  }
  EnvironmentVariables {
    Name: "GJS_DEBUG_OUTPUT"
    Value: "stderr"
  }
  EnvironmentVariables {
    Name: "GJS_DEBUG_TOPICS"
    Value: "JS ERROR;JS LOG"
  }
  EnvironmentVariables {
    Name: "GNOME_DESKTOP_SESSION_ID"
    Value: "this-is-deprecated"
  }
  EnvironmentVariables {
    Name: "GNOME_SHELL_SESSION_MODE"
    Value: "ubuntu"
  }
  EnvironmentVariables {
    Name: "GPG_AGENT_INFO"
    Value: "/run/user/1001/gnupg/S.gpg-agent:0:1"
  }
  EnvironmentVariables {
    Name: "GTK_MODULES"
    Value: "gail:atk-bridge"
  }
  EnvironmentVariables {
    Name: "HOME"
    Value: "/home/REPLACED"
  }
  EnvironmentVariables {
    Name: "IM_CONFIG_PHASE"
    Value: "1"
  }
  EnvironmentVariables {
    Name: "INVOCATION_ID"
    Value: "7ab26db5febb44d6986554585c198d04"
  }
  EnvironmentVariables {
    Name: "JOURNAL_STREAM"
    Value: "8:66666"
  }
  EnvironmentVariables {
    Name: "LD_LIBRARY_PATH"
    Value: ""
  }
  EnvironmentVariables {
    Name: "LD_PRELOAD"
    Value: ":{}"
  }
  EnvironmentVariables {
    Name: "LOGNAME"
    Value: "REPLACED"
  }
  EnvironmentVariables {
    Name: "MANAGERPID"
    Value: "2392"
  }
  EnvironmentVariables {
    Name: "NSYS_LD_LIBRARY_PATH"
    Value: ""
  }
  EnvironmentVariables {
    Name: "NSYS_QT_PLUGIN_PATH"
    Value: ""
  }
  EnvironmentVariables {
    Name: "NV_AGORA_CRASH_FD"
    Value: "7"
  }
  EnvironmentVariables {
    Name: "NV_AGORA_PATH"
    Value: "/opt/nvidia/nsight-systems/2023.3.3/host-linux-x64"
  }
  EnvironmentVariables {
    Name: "NV_QUADD_PATH"
    Value: "/opt/nvidia/nsight-systems/2023.3.3/host-linux-x64/nsys-ui.bin"
  }
  EnvironmentVariables {
    Name: "PATH"
    Value: "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda-11/bin"
  }
  EnvironmentVariables {
    Name: "PWD"
    Value: "/home/REPLACED"
  }
  EnvironmentVariables {
    Name: "QT_ACCESSIBILITY"
    Value: "1"
  }
  EnvironmentVariables {
    Name: "QT_IM_MODULE"
    Value: "ibus"
  }
  EnvironmentVariables {
    Name: "QT_PLUGIN_PATH"
    Value: ""
  }
  EnvironmentVariables {
    Name: "QUADD_INSTALL_DIR"
    Value: "/opt/nvidia/nsight-systems/2023.3.3/target-linux-x64"
  }
  EnvironmentVariables {
    Name: "SESSION_MANAGER"
    Value: "local/REPLACED:@/tmp/.ICE-unix/3021,unix/REPLACED:/tmp/.ICE-unix/3021"
  }
  EnvironmentVariables {
    Name: "SHELL"
    Value: "/bin/bash"
  }
  EnvironmentVariables {
    Name: "SHLVL"
    Value: "1"
  }
  EnvironmentVariables {
    Name: "SSH_AGENT_PID"
    Value: "2646"
  }
  EnvironmentVariables {
    Name: "SSH_AUTH_SOCK"
    Value: "/run/user/1001/keyring/ssh"
  }
  EnvironmentVariables {
    Name: "USER"
    Value: "REPLACED"
  }
  EnvironmentVariables {
    Name: "USERNAME"
    Value: "REPLACED"
  }
  EnvironmentVariables {
    Name: "WINDOWPATH"
    Value: "2"
  }
  EnvironmentVariables {
    Name: "XAUTHORITY"
    Value: "/run/user/1001/gdm/Xauthority"
  }
  EnvironmentVariables {
    Name: "XDG_CONFIG_DIRS"
    Value: "/etc/xdg/xdg-ubuntu:/etc/xdg"
  }
  EnvironmentVariables {
    Name: "XDG_CURRENT_DESKTOP"
    Value: "ubuntu:GNOME"
  }
  EnvironmentVariables {
    Name: "XDG_DATA_DIRS"
    Value: "/usr/share/ubuntu:/usr/local/share/:/usr/share/:/var/lib/snapd/desktop"
  }
  EnvironmentVariables {
    Name: "XDG_MENU_PREFIX"
    Value: "gnome-"
  }
  EnvironmentVariables {
    Name: "XDG_RUNTIME_DIR"
    Value: "/run/user/1001"
  }
  EnvironmentVariables {
    Name: "XDG_SESSION_CLASS"
    Value: "user"
  }
  EnvironmentVariables {
    Name: "XDG_SESSION_DESKTOP"
    Value: "ubuntu"
  }
  EnvironmentVariables {
    Name: "XDG_SESSION_TYPE"
    Value: "x11"
  }
  EnvironmentVariables {
    Name: "_"
    Value: "/opt/nvidia/nsight-systems/2023.3.3/host-linux-x64/CrashReporter"
  }
  CudaFlushPeriodically: true
  CudaFlushPeriod: 10000000000
  CudaSkipSomeApiCalls: true
  CollectGPUMemoryUsage: false
  CudaGraphTraceOptions {
    Mode: Graph
  }
  CudaFlushOnCudaProfilerStop: true
}
ShowBacktrace: true
UseDWARF: true
IncludeChildren: true
UseLinuxPerf: true
LinuxPerfOptions {
  CollectCpuCtxswTrace: true
  SamplingPeriod: 1000000
  CollectIPBacktraceSamples: true
  SamplesPerBacktrace: 4
  Mode: ProcessTree
  cpuIPSamplingTriggerEventIndex: 0
  triggerType: Software
}
SymbolResolutionOptions {
  ResolveSymbols: true
}
NetworkProfilingOptions {
  ShouldCollectNicMetrics: false
}

I also tried profiling with the following command line and had the same problem:

nsys profile --trace=cuda,nvtx,cudnn venv/bin/python -m MY_MODULE

Will try the newer Nsight Release and update with results.

I was only able to find 2023.3.1 despite the download page mentioning 2023.3.3.

$nsys --version

NVIDIA Nsight Systems version 2023.3.1.92-233133147223v0

The good news is that both nsys profile and the Nsight Systems UI successfully capture my program.

1 Like

Glad it is working. The web release is 2023.3.1 version which has newer code compared to the 2023.3.3 version from the CTK 12.3. The versioning scheme does not really reflect which is newer, unfortunately. Thanks for confirming that it works for you.

1 Like

For my purpose, this issue is resolved. Thanks