NVML 12.535.43.02 breaks backwards compatibility

BlueGoliath · May 31, 2023, 2:16am

The new nvmlProcessInfo_t usedGpuCcProtectedMemory field breaks compatibility with older driver versions as code using both the old struct byte size and nvmlDeviceGetGraphicsRunningProcesses_v3 or similar can accept either or as the struct isn’t actually versioned like it’s documented to be. nvidia-smi only works because it has been recompiled.

This really should be fixed in the next driver release. Please respond.

gardotd426 · May 31, 2023, 6:13am

I also wanted to repeat what @BlueGoliath said, there are bug fixes in 535.43 that I very much want but I don’t wanna have to sacrifice NVML to do it. Please respond.

BlueGoliath · June 8, 2023, 5:11am

This needs fixed. Bump.

BlueGoliath · June 14, 2023, 9:12pm

No one could take 15 minutes to fix this in 535.54.03? I guess I’ll just bump this thread every few days until someone fixes it.

msakthivel · July 5, 2023, 10:25pm

It seems you have built an exe/appln based on old version(driver/nvml)and trying to use it with new driver.
In such a case, to disable auto-upgrade of nvml apis, please define NVML_NO_UNVERSIONED_FUNC_DEFS(define NVML_NO_UNVERSIONED_FUNC_DEFS) in the application and explicitly specify the nvml-api with version(e.g. nvmlDeviceGetGraphicsRunningProcesses_v2).

BlueGoliath · July 6, 2023, 9:28am

Appreciate the reply but this doesn’t help any apps that want to provide support for the newest NVML/driver versions but have fallbacks in case symbols are missing. NVML now has two versions of nvmlDeviceGetGraphicsRunningProcesses_v3 and neither work with the other correctly.

I get that NVML isn’t a priority and that the focus is all AI right now but it would be great if more care was given for it. This isn’t the first time something was introduced to NVML that was broken. Nvidia introduced overclocking functions to NVML and they didn’t and still don’t work correctly despite NVAPI’s overclocking working just fine. There are multi-threading problems that already exist and are occasionally introduced into NVML, especially on Windows, that cause havoc on applications using it. Functions in NVML are documented to support a GPU architecture but either flat-out don’t or are weirdly partially supported like application clocks.

I’d love to send feedback to the appropriate people about all of these issues but I was told years ago Nvidia doesn’t provide support and responses here are unpredictable. I still have various threads in the NVAPI forum that are waiting for responses from Nvidia. The Linux forum is full of seemingly endless unanswered support threads about issues I don’t even know how people manage to run into.

Again, appreciate the response but I wish more care was given. I realize that the driver/software team is probably stretched thin between the massive amount of things Nvidia does but if we could avoid breaking things or introducing broken features or code at the very least that’d be great.

feechec441 · July 20, 2023, 1:44pm

I also ran into a compatibility issue while working with the new project nvmlProcessInfo_t structure in the NVIDIA Management Library (NVML). The introduction of the “usedGpuCcProtectedMemory” field in the latest version seems to break compatibility with older driver versions. The problem arises when code uses both the old struct byte size and functions like “nvmlDeviceGetGraphicsRunningProcesses_v3” or similar, as the struct isn’t truly versioned as documented.

gupadhyaya · July 20, 2023, 3:43pm

Hello,

We’re sorry for the ABI-breakage, and are working on a fix for this issue.

BlueGoliath · July 22, 2023, 10:42pm

feechec441 is a bot.

Thanks anyway.

kilian · August 2, 2023, 12:46am

It looks like a new driver has been released (535.86.10). But unfortunately, with that version, we’re getting a lot of:

symbol lookup error: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v3

In 535.54.03, we could see the 3 versioned functions:

# strings /usr/lib64/libnvidia-ml.so.535.54.03 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2
nvmlDeviceGetGraphicsRunningProcesses_v3

But in 535.86.10, there seems to be just 2 left:

# strings /usr/lib64/libnvidia-ml.so.535.86.10 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2

Has nvmlDeviceGetGraphicsRunningProcesses_v3 been removed from the NVML now?

CUDA 12.2 is still the recommended version for driver 535.86.10 as far as I can tell, and nvmlDeviceGetGraphicsRunningProcesses_v3() is still there in CUDA 12.2: NVML API Reference Guide :: GPU Deployment and Management Documentation

So I’m quite confused, how is that expected to work?

BlueGoliath · August 2, 2023, 3:51am

There doesn’t seem to be an easy way out of this so Nvidia just removed the function. If you aren’t interested in the confidential computing fields then you can downgrade to v2. Maybe they plan on adding a v3_1 or v4 in order to start over.

I had assumed that there was maybe an internal change that justified the v3 function despite no API change but I guess not?

kilian · August 2, 2023, 5:27am

nvmlDeviceGetGraphicsRunningProcesses_v3 is in the NVML since at least CUDA 11.7, meaning that with the automatic upgrade of the NVML APIs, everything that was compiled with a version of CUDA between 11.7 and 12.2 and didn’t explicitly calls _v2 actually uses _v3, which has now been removed from the library.

And the function has jsut been removed from the library, it’s still declared in nvml.h in the latest CUDA 12.2 Update 1 that’s been released alongside 535.86.10.

So that’s not fixing anything, that’s making things even worse. :\

msakthivel · August 2, 2023, 2:49pm

Hi,
Yes, “nvmlDeviceGetGraphicsRunningProcesses_v3” had a bug and was removed from the latest 535 driver release. We are working on updating the api. In the meantime, please use the older version “nvmlDeviceGetGraphicsRunningProcesses_v2” function instead.

kilian · August 2, 2023, 6:29pm

Well, you can’t remove existing functions from libraries without any warning or even a mention in the release notes, especially versioned functions that you decided were the default for the last 6 CUDA/NVML versions.

You can’t ship a library and its headers with mismatching functions either, that’s not how software development and release management work.

What NVIDIA just shipped here is a driver and NVML libs that break all the software that was compiled with up-to-date CUDA for the last year or so.

What’s your plan to fix this? Asking every user to modify code (theirs and 3rd party’s) to explicitly use an earlier version of a function they probably never even intentionally used (thanks to the APIs automatic upgrade) is clearly not very well thought out.

BlueGoliath · August 2, 2023, 10:33pm

Any existing application that used the old function was probably crashing unless it was recompiled anyway I imagine. A dynamically linked app from Python, Java, or any other non-native language would, anyway. Mine did.

I don’t get the point of the automatic API upgrading. If you go from nvmlDeviceGetMemoryInfo to nvmlDeviceGetMemoryInfo_v2 you would have to set the version field so it isn’t “free” anyway. Assuming you were interested in the reserved field, you’d have to write code for that as well.

My 2 cents is that auto upgrading of APIs should be removed. Nothing good comes from it.

wookayin · November 16, 2023, 7:51pm

This was a huge mess and mistake. I found this thread by Googling.

So now this has been fixed, but it’s unfortunate to see no an official answer from NVIDIA is made here saying that everything was fixed. I can see 535.104.05 fixed all the compatibility issue.

TL;DR) Avoid drivers 535.43~535.98 which broke backwards compatibility.

More details can be found:

535.104.05 · NVIDIA/nvidia-settings@74cae7f · GitHub
gpustat reports only the first program on nv driver 535 · Issue #161 · wookayin/gpustat · GitHub
fix(api/libnvml): fix removal for process info v3 APIs on the upstream 535.98 driver by XuehaiPan · Pull Request #89 · XuehaiPan/nvitop · GitHub

Topic		Replies	Views
Failed to initialize NVML: Driver/library version mismatch Linux	16	24464	December 10, 2024
Nvidia-smi results in Failed to initialize NVML: Driver/library version mismatch (Ubuntu 20.04.6 LTS) Linux	2	677	June 22, 2023
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Drivers - Linux, Windows, MacOS cuda	38	26590	August 29, 2024
nvidia-smi programmatically CUDA Programming and Performance	6	3769	May 7, 2012
Ailed to initialize NVML: Driver/library version mismatch Linux kernel , linux-driver-solutions , drivers	9	4696	April 20, 2022
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux ubuntu , cudnn , nvidia-smi	3	413	October 17, 2024
Failed to initialize NVML: Driver/library version mismatch CUDA Setup and Installation	6	74896	September 14, 2022
Error nvidia-smi :Failed to initialize NVML: Driver/library version mismatch Linux	17	1667	January 6, 2023
On Windows, getting list of graphics or compute processes can return invalid argument depending on launch System Management and Monitoring (NVML)	32	2388	July 6, 2024
Nvidia-smi error, Failed to initialize NVML: Driver/library version mismatch Drivers - Linux, Windows, MacOS linux-driver	2	5073	October 21, 2022

NVML 12.535.43.02 breaks backwards compatibility

Related topics