NVML 12.535.43.02 breaks backwards compatibility

The new nvmlProcessInfo_t usedGpuCcProtectedMemory field breaks compatibility with older driver versions as code using both the old struct byte size and nvmlDeviceGetGraphicsRunningProcesses_v3 or similar can accept either or as the struct isn’t actually versioned like it’s documented to be. nvidia-smi only works because it has been recompiled.

This really should be fixed in the next driver release. Please respond.

I also wanted to repeat what @BlueGoliath said, there are bug fixes in 535.43 that I very much want but I don’t wanna have to sacrifice NVML to do it. Please respond.

This needs fixed. Bump.

No one could take 15 minutes to fix this in 535.54.03? I guess I’ll just bump this thread every few days until someone fixes it.

It seems you have built an exe/appln based on old version(driver/nvml)and trying to use it with new driver.
In such a case, to disable auto-upgrade of nvml apis, please define NVML_NO_UNVERSIONED_FUNC_DEFS(define NVML_NO_UNVERSIONED_FUNC_DEFS) in the application and explicitly specify the nvml-api with version(e.g. nvmlDeviceGetGraphicsRunningProcesses_v2).

Appreciate the reply but this doesn’t help any apps that want to provide support for the newest NVML/driver versions but have fallbacks in case symbols are missing. NVML now has two versions of nvmlDeviceGetGraphicsRunningProcesses_v3 and neither work with the other correctly.

I get that NVML isn’t a priority and that the focus is all AI right now but it would be great if more care was given for it. This isn’t the first time something was introduced to NVML that was broken. Nvidia introduced overclocking functions to NVML and they didn’t and still don’t work correctly despite NVAPI’s overclocking working just fine. There are multi-threading problems that already exist and are occasionally introduced into NVML, especially on Windows, that cause havoc on applications using it. Functions in NVML are documented to support a GPU architecture but either flat-out don’t or are weirdly partially supported like application clocks.

I’d love to send feedback to the appropriate people about all of these issues but I was told years ago Nvidia doesn’t provide support and responses here are unpredictable. I still have various threads in the NVAPI forum that are waiting for responses from Nvidia. The Linux forum is full of seemingly endless unanswered support threads about issues I don’t even know how people manage to run into.

Again, appreciate the response but I wish more care was given. I realize that the driver/software team is probably stretched thin between the massive amount of things Nvidia does but if we could avoid breaking things or introducing broken features or code at the very least that’d be great.

I also ran into a compatibility issue while working with the new project nvmlProcessInfo_t structure in the NVIDIA Management Library (NVML). The introduction of the “usedGpuCcProtectedMemory” field in the latest version seems to break compatibility with older driver versions. The problem arises when code uses both the old struct byte size and functions like “nvmlDeviceGetGraphicsRunningProcesses_v3” or similar, as the struct isn’t truly versioned as documented.

Hello,

We’re sorry for the ABI-breakage, and are working on a fix for this issue.

feechec441 is a bot.

Thanks anyway.

It looks like a new driver has been released (535.86.10). But unfortunately, with that version, we’re getting a lot of:

symbol lookup error: undefined symbol: nvmlDeviceGetGraphicsRunningProcesses_v3

In 535.54.03, we could see the 3 versioned functions:

# strings /usr/lib64/libnvidia-ml.so.535.54.03 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2
nvmlDeviceGetGraphicsRunningProcesses_v3

But in 535.86.10, there seems to be just 2 left:

# strings /usr/lib64/libnvidia-ml.so.535.86.10 | grep nvmlDeviceGetGraphicsRunningProcesses | sort -u
nvmlDeviceGetGraphicsRunningProcesses
nvmlDeviceGetGraphicsRunningProcesses_v2

Has nvmlDeviceGetGraphicsRunningProcesses_v3 been removed from the NVML now?

CUDA 12.2 is still the recommended version for driver 535.86.10 as far as I can tell, and nvmlDeviceGetGraphicsRunningProcesses_v3() is still there in CUDA 12.2: NVML API Reference Guide :: GPU Deployment and Management Documentation

So I’m quite confused, how is that expected to work?

There doesn’t seem to be an easy way out of this so Nvidia just removed the function. If you aren’t interested in the confidential computing fields then you can downgrade to v2. Maybe they plan on adding a v3_1 or v4 in order to start over.

I had assumed that there was maybe an internal change that justified the v3 function despite no API change but I guess not?

nvmlDeviceGetGraphicsRunningProcesses_v3 is in the NVML since at least CUDA 11.7, meaning that with the automatic upgrade of the NVML APIs, everything that was compiled with a version of CUDA between 11.7 and 12.2 and didn’t explicitly calls _v2 actually uses _v3, which has now been removed from the library.

And the function has jsut been removed from the library, it’s still declared in nvml.h in the latest CUDA 12.2 Update 1 that’s been released alongside 535.86.10.

So that’s not fixing anything, that’s making things even worse. :\

Hi,
Yes, “nvmlDeviceGetGraphicsRunningProcesses_v3” had a bug and was removed from the latest 535 driver release. We are working on updating the api. In the meantime, please use the older version “nvmlDeviceGetGraphicsRunningProcesses_v2” function instead.

Well, you can’t remove existing functions from libraries without any warning or even a mention in the release notes, especially versioned functions that you decided were the default for the last 6 CUDA/NVML versions.

You can’t ship a library and its headers with mismatching functions either, that’s not how software development and release management work.

What NVIDIA just shipped here is a driver and NVML libs that break all the software that was compiled with up-to-date CUDA for the last year or so.

What’s your plan to fix this? Asking every user to modify code (theirs and 3rd party’s) to explicitly use an earlier version of a function they probably never even intentionally used (thanks to the APIs automatic upgrade) is clearly not very well thought out.

Any existing application that used the old function was probably crashing unless it was recompiled anyway I imagine. A dynamically linked app from Python, Java, or any other non-native language would, anyway. Mine did.

I don’t get the point of the automatic API upgrading. If you go from nvmlDeviceGetMemoryInfo to nvmlDeviceGetMemoryInfo_v2 you would have to set the version field so it isn’t “free” anyway. Assuming you were interested in the reserved field, you’d have to write code for that as well.

My 2 cents is that auto upgrading of APIs should be removed. Nothing good comes from it.

This was a huge mess and mistake. I found this thread by Googling.

So now this has been fixed, but it’s unfortunate to see no an official answer from NVIDIA is made here saying that everything was fixed. I can see 535.104.05 fixed all the compatibility issue.

TL;DR) Avoid drivers 535.43~535.98 which broke backwards compatibility.

More details can be found: