[QuadDCommon::tag_message*] = No GPU associated to the given UUID

jaiwant23 · June 17, 2024, 3:56pm

I am trying to use nsys start and nsys stop for profiling an application but during nsys stop I am getting following error
Can anyone tell what needs to be done

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Nsight version
NVIDIA Nsight Systems version 2024.1.1.59-241133802077v0

Here is nvidia-smi output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:41:00.0 Off |                   On |
| N/A   51C    P0   103W / 300W |                  N/A |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |  39023MiB / 40192MiB | 56      0 |  4   0    2    0    0 |
|                  |      3MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

mhallock · June 17, 2024, 4:19pm

Greetings @jaiwant23,

Sorry to hear you are having this issue. Can you see if you can reproduce the error with the most recent version (2024.4) of the software?

Also, what command line arguments are you using to perform your profiling?

jaiwant23 · June 20, 2024, 2:29am

Hi @mhallock,
I tried the latest version also but same error is coming in that also and nsys session is getting killed also post that error.
This is version that I am using now

nsys --version
NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0

Regarding the command line arguments I am starting the application with launch command with trace=cuda commands For profiling I am using nsys start and nsys stop command.
Error

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Any way to debug to this or fix this ?

Thanks

jaiwant23 · June 21, 2024, 10:58am

Hi @mhallock anything on above it seems the issue is happening on MIG based Gpu profiling ? Is there a wa y we can do nsight system profiling on that.

mhallock · June 22, 2024, 3:27pm

Hello,

Thanks for the additional information - It does sound like MIG is causing the issue here. Can you describe your execution environment a little more? Is your app being run in a container? Is the container running in “privileged” mode?

jaiwant23 · June 23, 2024, 4:52am

Hi @mhallock,
Its running on a Kubernetes pod ( container) with gpu details mentioned below.
It is not running in privileged mode. Same setup works for full GPU.
Does Nividia Insight Profiler support MIG or not ?

±----------------------------------------------------------------------------+
| MIG devices: |
±-----------------±---------------------±----------±----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 39023MiB / 40192MiB | 56 0 | 4 0 2 0 0 |
| | 3MiB / 65535MiB | | |
±-----------------±---------------------±----------±----------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

mhallock · June 24, 2024, 8:00pm

Thank you for the additional information.

Yes, MIG should be supported, at least for cuda trace. There are a few options relating to video APIs that are not. Just to understand your environment a little more so we can try and narrow in on the problem, can you tell me what hardware platform you are running on (physical x86_64 server, jetson-based system, or cloud instance)

Could you also provide the output of:

nsys status -e

Thank you for your patience.

jaiwant23 · June 25, 2024, 2:19am

Hi @mhallock ,

It is physical server : x86_64

nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 5.15.138.1-4.cm2: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

mhallock · June 25, 2024, 3:40pm

Thank you again for the info. I will to try and replicate this internally.

mhallock · June 28, 2024, 3:52am

@jaiwant23 I’ve been able to reproduce it and figure out what is going on.

There is an issue between the driver version installed and the cuda profiling library used by nsys, that seems to be triggered only when MIG is enabled. The issue is not actually with k8s; I think you should hit the same error on the host as well if you try as a non-root user.

I’ve found three possible ways forward for you. You can pick any one of these and end up with a working configuration:

Update the host to CUDA 12.2 w/ driver version 535.183.01, or newer
Downgrade to an older version of nsys (2023.1.2 is the newest that I believe will function)
Run the pod as privileged (it needs CAP_SYS_ADMIN to work correctly with this driver version)

jaiwant23 · July 1, 2024, 5:41am

Thanks I will try and update if I see issue with above.

jaiwant23 · July 9, 2024, 4:29am

Hi @mhallock ,
I did try the first 2 solutions

Using cuda 12.2 : There are incompatibility issue that we are observing with respective to use cuda 12 version with TF serving external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2024-07-05 13:44:07.595307: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
Downgrading nsys version : I did downgrade the version to the following one nsys --version NVIDIA Nsight Systems version 2022.1.3.3-1c7b5f7
But still same issue is occuring.
For running pod in privileged mode can it be run using the following way options nvidia NVreg_RestrictProfilingToAdminUsers=0 in /etc/modprobe.d.
was checking this issue NVIDIA Development Tools Solutions - | NVIDIA Developer since CAP_SYS_ADMIN is very high privileged permission

mhallock · July 10, 2024, 2:51pm

Hi @jaiwant23,

Sorry to hear you are still encountering issues.

It looks like you’d have to upgrade tensorflow to use the newer cuda versions. While that is a good thing in the long run, probably not what you want to do right now.
With driver version 525 that you were initially using, I tried four different nsys versions, but I did not try as far back as 2022.1. I tried (installed via apt):

2022.4.2: Worked
2023.1.2: Worked
2023.3.3: Failed
2023.2.3: Failed

~~I will need to investigate this option.~~ Edit: setting NVreg_RestrictProfilingToAdminUsers=0 should be sufficient.

mhallock · July 10, 2024, 10:11pm

This does appear to work. Be mindful you have to do it on the host OS, not in the container.

jaiwant23 · July 11, 2024, 1:24pm

Hi @mhallock ,
For options nvidia NVreg_RestrictProfilingToAdminUsers=0 tried this option also by having a file like this

/etc/modprobe.d ]$ cat  nvidia.conf
options nvidia NVreg_RestrictProfilingToAdminUsers=0

The node/host was also rebooted but still same issue I did check the ncu command which was failing earlier due to permission issue NVIDIA Development Tools Solutions - | NVIDIA Developer is now working so the flag is taking effect but nsys profiling for mig is still somehow having issue.

FATAL ERROR: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/GpuTraits/Src/GpuTicksConverter.cpp(371): Throw in function QuadDCommon::TimestampType GpuTraits::GpuTicksConverter::ConvertToCpuTime(const QuadDCommon::Uuid&, uint64_t&) const
Dynamic exception type: boost::wrapexcept<QuadDCommon::NotFoundException>
std::exception::what: NotFoundException
[QuadDCommon::tag_message*] = No GPU associated to the given UUID

jaiwant23 · July 15, 2024, 1:27pm

Hi @mhallock were you able to check this.

mhallock · July 16, 2024, 12:28pm

Greetings,

I’m sorry that didn’t work for you. My test setup is clearly not recreating your true issue properly.

We are working on a fix for the current nsys version and MIG problems. I’d still like to better understand exactly what you are encountering so that we can ensure it is addressed, since all of my mitigations so far have been unsuccessful.

Can you confirm/try a few things for me?

That you’ve tried nsys version 2023.1.2, that one worked “out of the box” for me on CUDA 12.0.
that your command line is just nsys launch -t cuda <application>, and not trying to capture any other trace types or options?
Can you try profiling some simple cuda samples? I have been using vectorAdd and UnifiedMemoryPerf in my testing. I’m curious if your actual application is triggering something that my tests are not.
Can you test that both directly on the host and in a container to see if there is a difference in behavior?
Lastly, can you try collecting just GPU metrics? Use nsys profile --gpu-metrics-device=all --duration=5 as I think that should exercise the module option that you’ve set. I’m just curious.

Thank you!

jaiwant23 · July 16, 2024, 3:07pm

Hi

I tried to downgrade nsys version 2022.4.2.1 with this I was able to profile application on MIG. The issue with downgrade is it will lack new features that get added in new version of nsys.
Yes I am using nsys launch -t cuda <application> with cuda trace only.
For others points listed I will check if I am able to get the setup working for them.

jaiwant23 · July 19, 2024, 2:20pm

Hi @mhallock ,

1st
I tried MIG profiling using below simple code I am seeing same issue in that also.
Here is the test sample

#include <cstdio>

__global__ void helloFromGPU() {
    printf("Hello World from GPU!\n");
}

int main() {
    helloFromGPU<<<1, 10>>>();
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("CUDA error: %s\n", cudaGetErrorString(err));
    }
    err = cudaDeviceSynchronize();
    if (err != cudaSuccess) {
        printf("CUDA error: %s\n", cudaGetErrorString(err));
    }
    fflush(stdout);  // Flush the output after device code execution
    return 0;
}

It was run using following nsys profile -t cuda ./testsample
It only works with nsys 2022.4.2.1
it also does not work with options nvidia NVreg_RestrictProfilingToAdminUsers=0
I ran the above code inside a K8s pod.

2nd
For this nsys profile --gpu-metrics-device=all --duration=5 on above example I am getting this output on MIG gpu

nsys profile --gpu-metrics-device=all --duration=5  ./example
Illegal --gpu-metrics-device arguments.
None of the installed GPUs are supported. See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gms-introduction

usage: nsys profile [<args>] [application] [<application args>]
Try 'nsys profile --help' for more information.

For above nsys version was nsys --version NVIDIA Nsight Systems version 2024.4.1.61-244134315967v0

mhallock · July 20, 2024, 2:58am

@jaiwant23 Thank you so much for testing that out and the additional information! I will be in touch soon.

Topic		Replies	Views
NSYS profiling does not work with MIG on GH200 Profiling Linux Targets	3	64	October 27, 2025
Help decipher logs(No GPU associated to the given GPU ID) Profiling Linux Targets	38	5030	November 28, 2022
Nsys profile not working with MIG Profiling Linux Targets	2	702	January 4, 2024
Generating CUPTI_* tables with nsys Profiling Linux Targets cuda	25	1936	January 12, 2023
Nsys profile error: invalidArgumentException, unknown API driver activity Profiling Linux Targets nsight	17	3726	July 28, 2023
Unable to capture "Can't find UUID for CUDA device" Profiling Linux Targets	10	2746	November 9, 2023
Nsys cannot collect cuda information on Drive OS 5.1 DRIVE AGX Xavier General drive-devtools	62	4407	October 12, 2021
Nsys command line on agx pegasus Profiling DRIVE Targets drive-devtools	13	2092	November 16, 2021
Nsight nsys not collecting any CUDA kernel data (2023.1.2.43-32377213v0) Profiling Linux Targets	19	2977	September 14, 2023
Nsight systems doesn't seemed to work correctly Profiling Linux Targets cuda	19	2873	July 10, 2023

[QuadDCommon::tag_message*] = No GPU associated to the given UUID

Related topics