It seems NVML has a bug where it incorrectly returns NVML_ERROR_NOT_SUPPORTED on certain calls for certain GPUs.
For example, I have a GeForce GTX TITAN which supports reporting utilization, yet the nvmlDeviceGetUtilizationRates call still returns NVML_ERROR_NOT_SUPPORTED incorrectly. I wrote a C program demonstrating both this behavior and a way to work around the bug here: [url]http://cfsworks.com/files/downloads/nvml_bug.c[/url]
The first time through, it returns NVML_ERROR_NOT_SUPPORTED for both calls, but the calls are clearly supported… Upon applying the workaround, the correct information comes back!
Even better, when I used GDB to call the workaround from within nvidia-smi, it correctly reported everything except ECC information (though this is fine as I’m fairly sure my TITAN is not equipped with ECC RAM).
Since the workaround isn’t terribly complex and doesn’t seem to have any side-effects, I’d imagine this to be a fairly simple bug to fix upstream.
@CFSworks:
Good work, i can’t wait to try.
And since nvidia never admitted that this “bug” was a marketing choice,
it will be fun to see how they’ll react to your code :)
@kokoko3k:
Yeah, unless they state that this behavior is intentional, I’m going to operate under the assumption that it isn’t.
And even if this is intentional, it’s still a bug because it leads to broken functionality. If it was a marketing choice, it was a bug in the marketers rather than the code. ;)
Also, I compiled with:
gcc -onvml_bug nvml_bug.c -lnvidia-ml
Found 1 device(s):
Device 0, "GeForce 9800 GT":
---- WITHOUT BUGFIX ----
Utilization: Not Supported
Power usage: Not Supported
---- WITH BUGFIX ----
Utilization: 2% GPU, 2% MEM
Power usage: Not Supported
Thank you :)
-EDIT
I was wondering why nvidia-smi was closed, but now…
PS: and just to reiterate what the OP said, it is absolutely ridiculous that metrics which are included in the graphical tool nvidia-settings are deliberately blocked in nvidia-smi and NVML solely for marketing purposes!
This is simply wrong! I would urge the developer/user community to voice their opinion on this - both here as well as in the form of an official bug report!
FYI: a colleague of mine has figured out the (probably most frequent) issue I and others have encountered. To sum it up briefly, gcc >=4.5 uses the --add-needed flag which causes incorrect linking and leads to the usual “Mismatch in versions between nvidia-smi and NVML…” error.
I’m using gcc version 4.4.7
In any case, I just modified the original nvml_bug.c as follows:
diff nvml_bug.c~ nvml_bug.c
17a18
!strcmp(version, "319.76") ||
then compiled it with
gcc -o nvml_bug nvml_bug.c -l nvidia-ml
and can successfully run it like this:
./nvml_bug
Found 1 device(s):
Device 0, “GeForce GTX 780 Ti”:
---- WITHOUT BUGFIX ----
Utilization: Not Supported
Power usage: Not Supported
---- WITH BUGFIX ----
Utilization: 59% GPU, 3% MEM
Power usage: 173W