Issue Activating HMM Feature on NVIDIA RTX A4500 with CUDA Toolkit 12.4 on Debian Bookworm

karsten.ehrlich · March 6, 2024, 11:17am

Hello NVIDIA Community,

I am reaching out for assistance with an issue I’m experiencing in activating the Heterogeneous Memory Management (HMM) feature on my system. Below are the specifics of my setup:

Operating System: Debian Bookworm, running Kernel version 6.6.13+bpo-amd64 from the backports.
GPU and Driver Details (from nvidia-smi -q):
- Driver Version: 550.54.14
- CUDA Version: 12.4
- GPU Model: NVIDIA RTX A4500 (Ampere architecture)

Despite meeting what I believe are the necessary conditions for HMM activation (recent Linux Kernel, updated NVIDIA driver, and the latest CUDA Toolkit), the HMM feature does not seem to be enabled. Here’s a snippet from my NVSMI log for clarity:

==============NVSMI LOG==============

Timestamp                                 : Wed Mar  6 11:55:48 2024
Driver Version                            : 550.54.14
CUDA Version                              : 12.4

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA RTX A4500
    Product Brand                         : NVIDIA RTX
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : None

Given this information, I would like to know if there are any reasons why my GPU might not support HMM, or if there are additional system checks I could perform. Additionally, is there a way to force the activation of this feature, or any specific configurations I might be missing?

Any guidance or insights from the community would be greatly appreciated. Thank you for your time and help!

paleonix · July 26, 2024, 10:23pm

I’m also on Debian Bookworm with Kernel 6.9.7+bpo-amd64 from bookworm-backports, driver 555.42.06 (with nvidia-kernel-open-dkms) from the Nvidia repo, running a RTX 4080 and I also get Addressing Mode : None. Could this be related to features of the motherboard/chipset/CPU/etc?

Robert_Crovella · July 27, 2024, 5:15pm

GeForce and Quadro gpus need a particular enablement step, possibly, see note at end.

Here’s what I did, starting with an Asus Z87-PRO motherboard and GeForce GTX 1660 Super GPU (Turing).

Fresh install of Ubuntu 24.04 desktop, selecting the default/basic installation, and selecting to add the proprietary drivers (NVIDIA GPU, codecs). When this was done it had a R535 driver installed. The installed kernel was 6.8.0-39-generic
Do the following: echo "options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" | sudo tee /etc/modprobe.d/nvidia-gsp.conf for the “unsupported” GeForce/Quadro option.
Install the CUDA toolkit 12.5.1
Install the open kernel module driver from the above link. I ended up with 555.42.06
Reboot
After that, nvidia-smi -a indicated HMM addressing mode:

$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp                                 : Sat Jul 27 12:13:32 2024
Driver Version                            : 555.42.06
CUDA Version                              : 12.5

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce GTX 1660 SUPER
    Product Brand                         : GeForce
    Product Architecture                  : Turing
    Display Mode                          : Enabled
    Display Active                        : Enabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : HMM
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-cddf7e4c-74b1-30a2-5445-ecf715827977
    Minor Number                          : 0
    VBIOS Version                         : 90.16.48.00.45
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 21C4-300-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.02.04
        OEM Object                        : 1.1
        ECC Object                        : N/A
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    GSP Firmware Version                  : 555.42.06
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x0
        Device Id                         : 0x21C410DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x40201458
        GPU Link Info
            PCIe Generation
                Max                       : 3
                Current                   : 1
                Device Current            : 1
                Device Max                : 3
                Host Max                  : 3
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 0 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : 0 %
    Performance State                     : P8
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 6144 MiB
        Reserved                          : 394 MiB
        Used                              : 109 MiB
        Free                              : 5643 MiB
    BAR1 Memory Usage
        Total                             : 256 MiB
        Used                              : 4 MiB
        Free                              : 252 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 2 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 40 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 96 C
        GPU Slowdown Temp                 : 93 C
        GPU Max Operating Temp            : 91 C
        GPU Target Temperature            : 83 C
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Power Draw                        : 12.68 W
        Current Power Limit               : 140.00 W
        Requested Power Limit             : 140.00 W
        Default Power Limit               : 140.00 W
        Min Power Limit                   : 70.00 W
        Max Power Limit                   : 176.00 W
    GPU Memory Power Readings
        Power Draw                        : N/A
    Module Power Readings
        Power Draw                        : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 300 MHz
        SM                                : 300 MHz
        Memory                            : 405 MHz
        Video                             : 540 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
    Processes
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 1923
            Type                          : G
            Name                          : /usr/lib/xorg/Xorg
            Used GPU Memory               : 37 MiB
        GPU instance ID                   : N/A
        Compute instance ID               : N/A
        Process ID                        : 2156
            Type                          : G
            Name                          : /usr/bin/gnome-shell
            Used GPU Memory               : 60 MiB
    Capabilities
        EGM                               : disabled

$

After all this, I deleted the /etc/modprobe.d/nvidia-gsp.conf file that got created, and rebooted, and nvidia-smi still showed addressing mode as HMM. So not sure what the important difference may be between my setup and others.

This blog may be of interest.

paleonix · July 27, 2024, 11:27pm

I was looking in the wrong places for something like echo "options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" | sudo tee /etc/modprobe.d/nvidia-gsp.conf, first of all that blog post and the CUDA 12.2 release notes. So thank you for pointing out where to find this detail.

Sadly it still doesn’t work for me. I switched back and forth between legacy and open as described here and even did a complete re-installation of Nvidia-related packages and many reboots without any luck. Ideally I would like to test it with a completely clean slate/fresh Debian installation, but that has to wait. So if someone else has either positive or negative results on Debian Bookworm I would love to hear about them.

paleonix · August 2, 2024, 6:29pm

@Robert_Crovella The 12.6 Release notes say that that modprobe.d file was not needed anymore since driver version 545, so no wonder it didn’t help me and didn’t affect HMM working for you.

I still don’t get HMM addressing with driver version 560.28.03 installed via the new meta package nvidia-open-560.

Robert_Crovella · December 11, 2024, 11:56pm

paleonix · December 12, 2024, 2:44am

Thanks for the heads-up!

Topic		Replies	Views
Heterogeneous Memory Support (HMM) in NVIDIA UVM driver and Linux 4.14 Linux	8	4580	March 19, 2023
Simplifying GPU Application Development with Heterogeneous Memory Management Technical Blog	0	396	August 22, 2023
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	556	September 11, 2024
RTX 4090 shows as "non-free GPU" when running NIM model in docker AI Foundation Models and Endpoints nim	8	1942	October 21, 2024
Issue in vGPU setup in Ubuntu 20.04.3 General Discussion ubuntu	23	9423	January 10, 2022
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62508	February 14, 2021
Nvidia process not running Linux	25	2854	December 31, 2021
Frequent catastrophic crashes on a multiple GPU machine CUDA Setup and Installation	8	4693	October 22, 2017
HP Zbook Studio G8 NVIDIA GeForce RTX 3080 Laptop GPU Ubuntu 18.04 Linux	9	1437	August 16, 2022
Need suggestions on diagnosing KMS related kernel hang Linux	28	945	March 12, 2025

Issue Activating HMM Feature on NVIDIA RTX A4500 with CUDA Toolkit 12.4 on Debian Bookworm

Related topics