Bug: NVML incorrectly detects certain GPUs as unsupported.

Hey all!

It seems NVML has a bug where it incorrectly returns NVML_ERROR_NOT_SUPPORTED on certain calls for certain GPUs.

For example, I have a GeForce GTX TITAN which supports reporting utilization, yet the nvmlDeviceGetUtilizationRates call still returns NVML_ERROR_NOT_SUPPORTED incorrectly. I wrote a C program demonstrating both this behavior and a way to work around the bug here: http://cfsworks.com/files/downloads/nvml_bug.c

The first time through, it returns NVML_ERROR_NOT_SUPPORTED for both calls, but the calls are clearly supported… Upon applying the workaround, the correct information comes back!

Even better, when I used GDB to call the workaround from within nvidia-smi, it correctly reported everything except ECC information (though this is fine as I’m fairly sure my TITAN is not equipped with ECC RAM).

Since the workaround isn’t terribly complex and doesn’t seem to have any side-effects, I’d imagine this to be a fairly simple bug to fix upstream.

@CFSworks:
Good work, i can’t wait to try.
And since nvidia never admitted that this “bug” was a marketing choice,
it will be fun to see how they’ll react to your code :)

-EDIT
Where is nvml.h? i tried cuda 5.0.35 package on archlinux, but it seems to be missing.
-EDIT
Found it https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML/tdk_3.304.5.tar.gz
But when i try to compile your code, i get a lot of undefined references like here:
https://devtalk.nvidia.com/default/topic/532229/using-nvml-in-c-program/

Could you help please?

@kokoko3k:
Yeah, unless they state that this behavior is intentional, I’m going to operate under the assumption that it isn’t.
And even if this is intentional, it’s still a bug because it leads to broken functionality. If it was a marketing choice, it was a bug in the marketers rather than the code. ;)

Also, I compiled with:
gcc -onvml_bug nvml_bug.c -lnvidia-ml

See if that works and let me know. :D

It is working really fine:

Found 1 device(s):
        Device 0, "GeForce 9800 GT":
                ---- WITHOUT BUGFIX ----
                Utilization: Not Supported
                Power usage: Not Supported
                ---- WITH BUGFIX ----
                Utilization: 2% GPU, 2% MEM
                Power usage: Not Supported

Thank you :)

-EDIT

I was wondering why nvidia-smi was closed, but now…

Just created an even better workaround for this, it’s here:
https://github.com/CFSworks/nvml_fix

This makes nvidia-smi work again!

Any chance of getting a fix for driver version 304.22?

Bump! Does anybody have an idea on how to

  • to make it work with the new 331 drivers;
  • fix or at least explain the reson behind the reported issues with various 319 and 325 drivers (https://github.com/CFSworks/nvml_fix/issues/3)?

PS: and just to reiterate what the OP said, it is absolutely ridiculous that metrics which are included in the graphical tool nvidia-settings are deliberately blocked in nvidia-smi and NVML solely for marketing purposes!

This is simply wrong! I would urge the developer/user community to voice their opinion on this - both here as well as in the form of an official bug report!

FYI: a colleague of mine has figured out the (probably most frequent) issue I and others have encountered. To sum it up briefly, gcc >=4.5 uses the --add-needed flag which causes incorrect linking and leads to the usual “Mismatch in versions between nvidia-smi and NVML…” error.

For more details see: https://github.com/CFSworks/nvml_fix/issues/3#issuecomment-30085297

I’m using gcc version 4.4.7
In any case, I just modified the original nvml_bug.c as follows:

diff nvml_bug.c~ nvml_bug.c

17a18

      !strcmp(version, "319.76") ||

then compiled it with
gcc -o nvml_bug nvml_bug.c -l nvidia-ml

and can successfully run it like this:
./nvml_bug
Found 1 device(s):
Device 0, “GeForce GTX 780 Ti”:
---- WITHOUT BUGFIX ----
Utilization: Not Supported
Power usage: Not Supported
---- WITH BUGFIX ----
Utilization: 59% GPU, 3% MEM
Power usage: 173W