Bug: NVML incorrectly detects certain GPUs as unsupported.

CFSworks · July 20, 2013, 12:54am

Hey all!

It seems NVML has a bug where it incorrectly returns NVML_ERROR_NOT_SUPPORTED on certain calls for certain GPUs.

For example, I have a GeForce GTX TITAN which supports reporting utilization, yet the nvmlDeviceGetUtilizationRates call still returns NVML_ERROR_NOT_SUPPORTED incorrectly. I wrote a C program demonstrating both this behavior and a way to work around the bug here: [url]http://cfsworks.com/files/downloads/nvml_bug.c[/url]

The first time through, it returns NVML_ERROR_NOT_SUPPORTED for both calls, but the calls are clearly supported… Upon applying the workaround, the correct information comes back!

Even better, when I used GDB to call the workaround from within nvidia-smi, it correctly reported everything except ECC information (though this is fine as I’m fairly sure my TITAN is not equipped with ECC RAM).

Since the workaround isn’t terribly complex and doesn’t seem to have any side-effects, I’d imagine this to be a fairly simple bug to fix upstream.

kokoko3k · July 20, 2013, 10:50am

@CFSworks:
Good work, i can’t wait to try.
And since nvidia never admitted that this “bug” was a marketing choice,
it will be fun to see how they’ll react to your code :)

-EDIT
Where is nvml.h? i tried cuda 5.0.35 package on archlinux, but it seems to be missing.
-EDIT
Found it https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML/tdk_3.304.5.tar.gz
But when i try to compile your code, i get a lot of undefined references like here:
Using NVML in C program - CUDA Programming and Performance - NVIDIA Developer Forums

Could you help please?

CFSworks · July 20, 2013, 11:31am

@kokoko3k:
Yeah, unless they state that this behavior is intentional, I’m going to operate under the assumption that it isn’t.
And even if this is intentional, it’s still a bug because it leads to broken functionality. If it was a marketing choice, it was a bug in the marketers rather than the code. ;)

Also, I compiled with:
gcc -onvml_bug nvml_bug.c -lnvidia-ml

See if that works and let me know. :D

kokoko3k · July 20, 2013, 1:01pm

It is working really fine:

Found 1 device(s):
        Device 0, "GeForce 9800 GT":
                ---- WITHOUT BUGFIX ----
                Utilization: Not Supported
                Power usage: Not Supported
                ---- WITH BUGFIX ----
                Utilization: 2% GPU, 2% MEM
                Power usage: Not Supported

Thank you :)

-EDIT

I was wondering why nvidia-smi was closed, but now…

CFSworks · July 21, 2013, 8:02am

Just created an even better workaround for this, it’s here:
[url]https://github.com/CFSworks/nvml_fix[/url]

This makes nvidia-smi work again!

calgaryresident · July 25, 2013, 5:27pm

Any chance of getting a fix for driver version 304.22?

pszilard · November 12, 2013, 2:20pm

Bump! Does anybody have an idea on how to

to make it work with the new 331 drivers;
fix or at least explain the reson behind the reported issues with various 319 and 325 drivers (https://github.com/CFSworks/nvml_fix/issues/3)?

pszilard · November 12, 2013, 2:42pm

PS: and just to reiterate what the OP said, it is absolutely ridiculous that metrics which are included in the graphical tool nvidia-settings are deliberately blocked in nvidia-smi and NVML solely for marketing purposes!

This is simply wrong! I would urge the developer/user community to voice their opinion on this - both here as well as in the form of an official bug report!

pszilard · January 7, 2014, 7:32pm

FYI: a colleague of mine has figured out the (probably most frequent) issue I and others have encountered. To sum it up briefly, gcc >=4.5 uses the --add-needed flag which causes incorrect linking and leads to the usual “Mismatch in versions between nvidia-smi and NVML…” error.

For more details see: [url]New driver version 325.15 · Issue #3 · CFSworks/nvml_fix · GitHub

nunni · January 30, 2014, 6:33pm

I’m using gcc version 4.4.7
In any case, I just modified the original nvml_bug.c as follows:

diff nvml_bug.c~ nvml_bug.c

17a18

      !strcmp(version, "319.76") ||

then compiled it with
gcc -o nvml_bug nvml_bug.c -l nvidia-ml

and can successfully run it like this:
./nvml_bug
Found 1 device(s):
Device 0, “GeForce GTX 780 Ti”:
---- WITHOUT BUGFIX ----
Utilization: Not Supported
Power usage: Not Supported
---- WITH BUGFIX ----
Utilization: 59% GPU, 3% MEM
Power usage: 173W

Topic		Replies	Views
How to get nvidia-smi working! Linux	10	18925	March 30, 2014
nvmlDeviceGetUtilizationRates fails with NVML_ERROR_UNKNOWN System Management and Monitoring (NVML)	0	2786	October 9, 2015
How to call NVML APIs? CUDA Programming and Performance	5	17321	October 18, 2011
NVML 12.535.43.02 breaks backwards compatibility System Management and Monitoring (NVML)	15	2235	November 16, 2023
Nvidia-smi: gpm-metrics not populated System Management and Monitoring (NVML)	7	2269	June 26, 2023
GPU loss Linux	7	13671	April 3, 2019
nvmlDeviceGetHandleByIndex does not return NVML_SUCCESS General Discussion cuda , driver , windows-driver , nvml	0	147	July 11, 2024
Ailed to initialize NVML: Driver/library version mismatch Linux kernel , linux-driver-solutions , drivers	9	4716	April 20, 2022
nvmlDeviceSetDefaultFanSpeed_v2 does not resume fan speed algorithm! Please fix! Linux	1	911	May 16, 2022
Broken GPU state query failure in AMD + H100 Confidential Computing	10	973	February 15, 2024

Bug: NVML incorrectly detects certain GPUs as unsupported.

diff nvml_bug.c~ nvml_bug.c

Related topics