[PyPI/nvidia-ml-py] Issue Reports for `nvidia-ml-py`

XuehaiPan · November 29, 2021, 9:21am

I tried to report issues for nvidia-ml-py to the e-mail I found on PyPI: nvml-bindings@nvidia.com. But my e-mail was rejected by the e-mail server:

The recipient’s domain has rejected your message because there is no recipient’s e-mail address in the domain’s directory. It may be that the address is misspelled or does not exist.

I try to repost my issues in this forum and wait for an update.

The original post:

Dear maintainers of nvidia-ml-py:

Firstly, thanks so much for creating and maintaining such a useful package. It allows users to write monitoring tools for NVIDIA GPUs in Python. I found some issues and/or bugs while creating my top-like monitor. I didn’t find a place (like GitHub issues) to report them so I decided to write an e-mail to the address I found on PyPI.

Issues and questions:

Is there any open-source or code hosting plan to GitHub (or similar) like NVIDIA/go-nvml. This will greatly facilitate the submission of issues and improve the bindings.

Backward compatibility between driver and binding versions.

Since CUDA 11, the definition of nvmlProcessInfo_t adds two new fields gpuInstanceId and computeInstanceId.

/**
 * Information about running compute processes on the GPU
 */
typedef struct nvmlProcessInfo_st
{
    unsigned int        pid;                //!< Process ID
    unsigned long long  usedGpuMemory;      //!< Amount of used GPU memory in bytes.
                                            //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                            //! because Windows KMD manages all the memory and not the NVIDIA driver
    unsigned int        gpuInstanceId;      //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
    unsigned int        computeInstanceId;  //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
                                            //  0xFFFFFFFF otherwise.
} nvmlProcessInfo_t;

The Python bindings will get wrong results or raise FunctionNotFound error with pre-11 drivers (widely used in Ubuntu 16.04 LTS).


v1	NVIDIA Driver 430.64	NVIDIA Driver 470.57.02
`nvidia-ml-py==11.450.51`	works but without `CI ID` / `GI ID`	works but without `CI ID` / `GI ID`
`nvidia-ml-py>=11.450.129`	no exceptions in Python but gets wrong results (subscript out of range in C library)	no exceptions in Python but gets wrong results (subscript out of range in C library)

v2	NVIDIA Driver 430.64	NVIDIA Driver 470.57.02
`nvidia-ml-py==11.450.51`	function not found	no exceptions in Python but gets wrong results (subscript out of range in C library)
`nvidia-ml-py>=11.450.129`	function not found	works with correct `CI ID` / `GI ID`

Similar issues on NVIDIA/go-nvml: issue NVIDIA/go-nvml#21 and pull request NVIDIA/go-nvml#25.

NVIDIA/go-nvml claims it is designed to be backward compatible:

These bindings are not a reimplementation of NVML in Go, but rather a set of wrappers around the C API provided by libnvidia-ml.so. This library is part of the standard NVIDIA driver distribution, and should be available on any Linux system that has the NVIDIA driver installed. The API is designed to be backwards compatible, so the latest bindings should work with any version of libnvidia-ml.so installed on your system.

NVIDIA/go-nvml looks up for the versioned API (suffixed with _v1, _v2, etc.) on initialization and set the unversioned bindings to the compatible version for the driver on the system. Is there will be similar handling for the Python bindings nvidia-ml-py?

Bug: the bindings should return Python types rather than Ctypes.

The function nvmlDeviceIsMigDeviceHandle was added to nvidia-ml-py since version 11.450.51. It returns c_uint rather than a Python type int or bool.

def nvmlDeviceIsMigDeviceHandle(device):
    c_isMigDevice = c_uint()
    fn = _nvmlGetFunctionPointer("nvmlDeviceIsMigDeviceHandle")
    ret = fn(device, byref(c_isMigDevice))
    _nvmlCheckReturn(ret)
    return c_isMigDevice

The return statement should be changed to return c_isMigDevice.value like other bindings do.

def nvmlDeviceIsMigDeviceHandle(device):
    c_isMigDevice = c_uint()
    fn = _nvmlGetFunctionPointer("nvmlDeviceIsMigDeviceHandle")
    ret = fn(device, byref(c_isMigDevice))
    _nvmlCheckReturn(ret)
-   return c_isMigDevice
+   return c_isMigDevice.value

Waiting for a reply!

Sincerely

Xuehai Pan

wookayin · March 15, 2022, 4:47pm

Another breaking change.

nvidia-ml-py 11.515.0 (Jan 12, 2022) now even introduces v3 (nvmlDeviceGetComputeRunningProcesses_v3, etc.). But looking at the diff, the actual implementation hasn’t changed other than the function pointer refers to nvmlDeviceGetGraphicsRunningProcesses_v3 rather than _v2. What change has been made?

This breaks older driver versions, because older versions (probably before Jan 2022) of nvidia driver does not have the low-level function `nvmlDeviceGetGraphicsRunningProcesses_v3: for instance, 470.86 for me.

Can we make the nvidia-ml-py bindings backward compatible with old driver versions? If v3 functions are not available with the installed nvidia driver, we could fallback to v2 or v1 versions (but subject to changes in data structure as well).

Topic		Replies	Views
Pynvml bindings broken System Management and Monitoring (NVML) nvbugs , python	0	427	March 6, 2024
Ailed to initialize NVML: Driver/library version mismatch Linux kernel , linux-driver-solutions , drivers	9	5009	April 20, 2022
NVML 12.535.43.02 breaks backwards compatibility System Management and Monitoring (NVML)	15	2585	November 16, 2023
Nvidia-smi error, Failed to initialize NVML: Driver/library version mismatch Drivers - Linux, Windows, MacOS linux-driver	2	5306	October 21, 2022
nvidia-smi programmatically CUDA Programming and Performance	6	3988	May 7, 2012
Python 3 support for PyNVML System Management and Monitoring (NVML)	1	5409	November 25, 2017
Bug: NVML incorrectly detects certain GPUs as unsupported. System Management and Monitoring (NVML)	9	11794	January 30, 2014
Failed to initialize NVML: Driver/library version mismatch System Management and Monitoring (NVML) cuda	0	614	March 6, 2024
Apparent minor errors in pynvml.py in nvidia-ml-py package System Management and Monitoring (NVML) python	0	318	June 10, 2024
Mistake with nvidia-ml-py on PyPi GPU-Accelerated Libraries	11	4655	March 21, 2022

[PyPI/nvidia-ml-py] Issue Reports for `nvidia-ml-py`

Related topics