[QuadDCommon::tag_message*] = No GPU associated to the given UUID

I am trying to use nsys start and nsys stop for profiling an application but during nsys stop I am getting following error
Can anyone tell what needs to be done

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Nsight version
NVIDIA Nsight Systems version 2024.1.1.59-241133802077v0

Here is nvidia-smi output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                   On |
| N/A   51C    P0   103W / 300W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |  39023MiB / 40192MiB | 56      0 |  4   0    2    0    0 |
|                  |      3MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Greetings @jaiwant23,

Sorry to hear you are having this issue. Can you see if you can reproduce the error with the most recent version (2024.4) of the software?

Also, what command line arguments are you using to perform your profiling?

Hi @mhallock,
I tried the latest version also but same error is coming in that also and nsys session is getting killed also post that error.
This is version that I am using now

nsys --version
NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0

Regarding the command line arguments I am starting the application with launch command with trace=cuda commands For profiling I am using nsys start and nsys stop command.
Error

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Any way to debug to this or fix this ?

Thanks

Hi @mhallock anything on above it seems the issue is happening on MIG based Gpu profiling ? Is there a wa y we can do nsight system profiling on that.

Hello,

Thanks for the additional information - It does sound like MIG is causing the issue here. Can you describe your execution environment a little more? Is your app being run in a container? Is the container running in “privileged” mode?

Hi @mhallock,
Its running on a Kubernetes pod ( container) with gpu details mentioned below.
It is not running in privileged mode. Same setup works for full GPU.
Does Nividia Insight Profiler support MIG or not ?

±----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G… On | 00000000:41:00.0 Off | On |
| N/A 51C P0 103W / 300W | N/A | N/A Default |
| | | Enabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| MIG devices: |
±-----------------±---------------------±----------±----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 39023MiB / 40192MiB | 56 0 | 4 0 2 0 0 |
| | 3MiB / 65535MiB | | |
±-----------------±---------------------±----------±----------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Thank you for the additional information.

Yes, MIG should be supported, at least for cuda trace. There are a few options relating to video APIs that are not. Just to understand your environment a little more so we can try and narrow in on the problem, can you tell me what hardware platform you are running on (physical x86_64 server, jetson-based system, or cloud instance)

Could you also provide the output of:

nsys status -e

Thank you for your patience.

Hi @mhallock ,

It is physical server : x86_64

nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 5.15.138.1-4.cm2: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

Thank you again for the info. I will to try and replicate this internally.

@jaiwant23 I’ve been able to reproduce it and figure out what is going on.

There is an issue between the driver version installed and the cuda profiling library used by nsys, that seems to be triggered only when MIG is enabled. The issue is not actually with k8s; I think you should hit the same error on the host as well if you try as a non-root user.

I’ve found three possible ways forward for you. You can pick any one of these and end up with a working configuration:

  • Update the host to CUDA 12.2 w/ driver version 535.183.01, or newer
  • Downgrade to an older version of nsys (2023.1.2 is the newest that I believe will function)
  • Run the pod as privileged (it needs CAP_SYS_ADMIN to work correctly with this driver version)

Thanks I will try and update if I see issue with above.

1 Like

Hi @mhallock ,
I did try the first 2 solutions

  1. Using cuda 12.2 : There are incompatibility issue that we are observing with respective to use cuda 12 version with TF serving external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2024-07-05 13:44:07.595307: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory

  2. Downgrading nsys version : I did downgrade the version to the following one nsys --version NVIDIA Nsight Systems version 2022.1.3.3-1c7b5f7
    But still same issue is occuring.

  3. For running pod in privileged mode can it be run using the following way options nvidia NVreg_RestrictProfilingToAdminUsers=0 in /etc/modprobe.d.
    was checking this issue NVIDIA Development Tools Solutions - | NVIDIA Developer since CAP_SYS_ADMIN is very high privileged permission

Hi @jaiwant23,

Sorry to hear you are still encountering issues.

  1. It looks like you’d have to upgrade tensorflow to use the newer cuda versions. While that is a good thing in the long run, probably not what you want to do right now.

  2. With driver version 525 that you were initially using, I tried four different nsys versions, but I did not try as far back as 2022.1. I tried (installed via apt):

  • 2022.4.2: Worked
  • 2023.1.2: Worked
  • 2023.3.3: Failed
  • 2023.2.3: Failed
  1. I will need to investigate this option. Edit: setting NVreg_RestrictProfilingToAdminUsers=0 should be sufficient.

This does appear to work. Be mindful you have to do it on the host OS, not in the container.

1 Like

Hi @mhallock ,
For options nvidia NVreg_RestrictProfilingToAdminUsers=0 tried this option also by having a file like this

/etc/modprobe.d ]$ cat  nvidia.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0

The node/host was also rebooted but still same issue I did check the ncu command which was failing earlier due to permission issue NVIDIA Development Tools Solutions - | NVIDIA Developer is now working so the flag is taking effect but nsys profiling for mig is still somehow having issue.

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Hi @mhallock were you able to check this.

Greetings,

I’m sorry that didn’t work for you. My test setup is clearly not recreating your true issue properly.

We are working on a fix for the current nsys version and MIG problems. I’d still like to better understand exactly what you are encountering so that we can ensure it is addressed, since all of my mitigations so far have been unsuccessful.

Can you confirm/try a few things for me?

  • That you’ve tried nsys version 2023.1.2, that one worked “out of the box” for me on CUDA 12.0.
  • that your command line is just nsys launch -t cuda <application>, and not trying to capture any other trace types or options?
  • Can you try profiling some simple cuda samples? I have been using vectorAdd and UnifiedMemoryPerf in my testing. I’m curious if your actual application is triggering something that my tests are not.
  • Can you test that both directly on the host and in a container to see if there is a difference in behavior?
  • Lastly, can you try collecting just GPU metrics? Use nsys profile --gpu-metrics-device=all --duration=5 as I think that should exercise the module option that you’ve set. I’m just curious.

Thank you!

Hi

  1. I tried to downgrade nsys version 2022.4.2.1 with this I was able to profile application on MIG. The issue with downgrade is it will lack new features that get added in new version of nsys.
  2. Yes I am using nsys launch -t cuda <application> with cuda trace only.
    For others points listed I will check if I am able to get the setup working for them.

Hi @mhallock ,

1st
I tried MIG profiling using below simple code I am seeing same issue in that also.
Here is the test sample

#include <cstdio>

__global__ void helloFromGPU() {
    printf("Hello World from GPU!\n");
}

int main() {
    helloFromGPU<<<1, 10>>>();
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("CUDA error: %s\n", cudaGetErrorString(err));
    }
    err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        printf("CUDA error: %s\n", cudaGetErrorString(err));
    }
    fflush(stdout);  // Flush the output after device code execution
    return 0;
}
  • It was run using following nsys profile -t cuda ./testsample
  • It only works with nsys 2022.4.2.1
  • it also does not work with options nvidia NVreg_RestrictProfilingToAdminUsers=0
  • I ran the above code inside a K8s pod.

2nd
For this nsys profile --gpu-metrics-device=all --duration=5 on above example I am getting this output on MIG gpu

nsys profile --gpu-metrics-device=all --duration=5  ./example
Illegal --gpu-metrics-device arguments.
None of the installed GPUs are supported. See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.

For above nsys version was nsys --version NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0

@jaiwant23 Thank you so much for testing that out and the additional information! I will be in touch soon.