[PyPI/nvidia-ml-py] Issue Reports for `nvidia-ml-py`

I tried to report issues for nvidia-ml-py to the e-mail I found on PyPI: nvml-bindings@nvidia.com. But my e-mail was rejected by the e-mail server:

The recipient’s domain has rejected your message because there is no recipient’s e-mail address in the domain’s directory. It may be that the address is misspelled or does not exist.

I try to repost my issues in this forum and wait for an update.


The original post:

Dear maintainers of nvidia-ml-py:

Firstly, thanks so much for creating and maintaining such a useful package. It allows users to write monitoring tools for NVIDIA GPUs in Python. I found some issues and/or bugs while creating my top-like monitor. I didn’t find a place (like GitHub issues) to report them so I decided to write an e-mail to the address I found on PyPI.

Issues and questions:

  1. Is there any open-source or code hosting plan to GitHub (or similar) like NVIDIA/go-nvml. This will greatly facilitate the submission of issues and improve the bindings.

  2. Backward compatibility between driver and binding versions.

    Since CUDA 11, the definition of nvmlProcessInfo_t adds two new fields gpuInstanceId and computeInstanceId.

    /**
     * Information about running compute processes on the GPU
     */
    typedef struct nvmlProcessInfo_st
    {
        unsigned int        pid;                //!< Process ID
        unsigned long long  usedGpuMemory;      //!< Amount of used GPU memory in bytes.
                                                //! Under WDDM, \ref NVML_VALUE_NOT_AVAILABLE is always reported
                                                //! because Windows KMD manages all the memory and not the NVIDIA driver
        unsigned int        gpuInstanceId;      //!< If MIG is enabled, stores a valid GPU instance ID. gpuInstanceId is set to
                                                //  0xFFFFFFFF otherwise.
        unsigned int        computeInstanceId;  //!< If MIG is enabled, stores a valid compute instance ID. computeInstanceId is set to
                                                //  0xFFFFFFFF otherwise.
    } nvmlProcessInfo_t;
    

    The Python bindings will get wrong results or raise FunctionNotFound error with pre-11 drivers (widely used in Ubuntu 16.04 LTS).

    v1 NVIDIA Driver 430.64 NVIDIA Driver 470.57.02
    nvidia-ml-py==11.450.51 works but without CI ID / GI ID works but without CI ID / GI ID
    nvidia-ml-py>=11.450.129 no exceptions in Python
    but gets wrong results
    (subscript out of range in C library)
    no exceptions in Python
    but gets wrong results
    (subscript out of range in C library)
    v2 NVIDIA Driver 430.64 NVIDIA Driver 470.57.02
    nvidia-ml-py==11.450.51 function not found no exceptions in Python
    but gets wrong results
    (subscript out of range in C library)
    nvidia-ml-py>=11.450.129 function not found works with correct CI ID / GI ID

    Similar issues on NVIDIA/go-nvml: issue NVIDIA/go-nvml#21 and pull request NVIDIA/go-nvml#25.

    NVIDIA/go-nvml claims it is designed to be backward compatible:

    These bindings are not a reimplementation of NVML in Go, but rather a set of wrappers around the C API provided by libnvidia-ml.so. This library is part of the standard NVIDIA driver distribution, and should be available on any Linux system that has the NVIDIA driver installed. The API is designed to be backwards compatible, so the latest bindings should work with any version of libnvidia-ml.so installed on your system.

    NVIDIA/go-nvml looks up for the versioned API (suffixed with _v1, _v2, etc.) on initialization and set the unversioned bindings to the compatible version for the driver on the system. Is there will be similar handling for the Python bindings nvidia-ml-py?

  3. Bug: the bindings should return Python types rather than Ctypes.

    The function nvmlDeviceIsMigDeviceHandle was added to nvidia-ml-py since version 11.450.51. It returns c_uint rather than a Python type int or bool.

    def nvmlDeviceIsMigDeviceHandle(device):
        c_isMigDevice = c_uint()
        fn = _nvmlGetFunctionPointer("nvmlDeviceIsMigDeviceHandle")
        ret = fn(device, byref(c_isMigDevice))
        _nvmlCheckReturn(ret)
        return c_isMigDevice
    

    The return statement should be changed to return c_isMigDevice.value like other bindings do.

    def nvmlDeviceIsMigDeviceHandle(device):
        c_isMigDevice = c_uint()
        fn = _nvmlGetFunctionPointer("nvmlDeviceIsMigDeviceHandle")
        ret = fn(device, byref(c_isMigDevice))
        _nvmlCheckReturn(ret)
    -   return c_isMigDevice
    +   return c_isMigDevice.value
    

Waiting for a reply!

Sincerely

Xuehai Pan

Another breaking change.

nvidia-ml-py 11.515.0 (Jan 12, 2022) now even introduces v3 (nvmlDeviceGetComputeRunningProcesses_v3, etc.). But looking at the diff, the actual implementation hasn’t changed other than the function pointer refers to nvmlDeviceGetGraphicsRunningProcesses_v3 rather than _v2. What change has been made?

This breaks older driver versions, because older versions (probably before Jan 2022) of nvidia driver does not have the low-level function `nvmlDeviceGetGraphicsRunningProcesses_v3: for instance, 470.86 for me.

Can we make the nvidia-ml-py bindings backward compatible with old driver versions? If v3 functions are not available with the installed nvidia driver, we could fallback to v2 or v1 versions (but subject to changes in data structure as well).