Heterogeneous Memory Support (HMM) in NVIDIA UVM driver and Linux 4.14

lubos.krcal · December 14, 2017, 6:33am

Hi, I have been trying to make HMM work wit the newest possible setup. There have been a few topics about HMM support on the forum before, but none of them had any follow-ups. I have investigated the Nvidia drivers, and there already seems to be support for HMM, however I didn’t manage to make it work.

My setup: CUDA 9.1.85, NVIDIA driver 387.26, Linux 4.14.5 (Ubuntu build), GTX 1050 Ti and GTX 1080 Ti.

First I checked the HMM support in the kernel:

grep HMM /boot/config-4.14.5-041405-generic
CONFIG_ARCH_HAS_HMM=y
CONFIG_HMM=y
CONFIG_HMM_MIRROR=y

Then I modified the nvidia-uvm module sources and recompiled the module to enable HMM support, reinserted it with the required parameters, and made sure that the hmm correctly enabled using uvm_hmm_is_enabled:

// You need all of these things, in order to actually run HMM:
//
//     1) An HMM kernel, with CONFIG_HMM set.
//
//     2) UVM Kernel module parameter set: uvm_hmm=1
//
//     3) ATS must not be enabled
//
bool uvm_hmm_is_enabled(void)
{
    return (uvm_hmm != 0) && (uvm8_ats_mode == 0);
}

Got a small example to test the HMM:

$ cat ./hmm-test.cu
#include <stdio.h>

__global__ void
compute_this(int *pDataFromCpu)
{
    atomicAdd(pDataFromCpu, 1);
}

int main(void)
{
    int *pData = (int*)malloc(sizeof(int));
    *pData = 1;

    compute_this<<<1,1024>>>(pData);
    if (cudaDeviceSynchronize() != cudaSuccess)
        printf("Error \n");

    printf("Results: %d\n", *pData);
    free(pData);
    return 0;
}
$ /usr/local/cuda-9.1/bin/nvcc -gencode arch=compute_61,code=sm_61 hmm-test.cu -o hmm-test
$ ./hmm-test 
Error 
Results: 1

The corresponding dmesg is:

[Thu Dec 14 19:04:40 2017] nvidia-uvm: Unloaded the UVM driver in 8 mode
[Thu Dec 14 19:04:43 2017] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 238
[Thu Dec 14 19:04:49 2017] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000038, engmask 00000101, intr 10000000

When I try to ftrace the nvidia-uvm module, it successfully calls into linux kernel’s hmm functions, namel:

hmm_device_new()
hmm_device_put()
hmm_device_release(
hmm_mirror_register()
hmm_mirror_unregister()

and gets callbacks through .sync_cpu_device_pagetables = &mirror_sync_cpu_device_pagetables; however this callback is not implemented in the nvidia-uvm driver. This is a small sample of address ranges from the sync callback:

mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21bef3c000 - 0x00007f21c0000000 
mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21dee00000 - 0x00007f21df000000 
mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21d0000000 - 0x00007f21def3b000 
mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21d5f39000 - 0x00007f21d8000000 
mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21dc000000 - 0x00007f21ddf39000 
mirror_sync_cpu_device_pagetables: type: 0, 0x00007f21d8000000 - 0x00007f21d8200000
...

Is there a way to make the current nvidia-uvm module and CUDA runtime work with Linux HMM, given that the code is already present in the driver, and can easily be enabled?

oscarbg · December 16, 2017, 10:19pm

Hi,
thanks for your evaluation of current Nvidia HMM support…
some things to say:

seems 390.xx series drivers is coming this year (see phoronix NVIDIA Confirms Linux Driver Performance Regression, To Be Fixed In 390 Series - Phoronix)
“Succeeding the NVIDIA 387 driver series will be the NVIDIA 390 Linux driver series, which should get into public beta in December.”
hope you can test with 390 driver to see if mirror_sync_cpu_device_pagetables callback is implemented and HMM finally works…
tried to replicate your tests with same CUDA 9.1 compiler, latest driver (387.34 patched for 4.15rc2 support) and kernel 4.15rc2 which is HMM enabled…
but I use Maxwell GTX970 GPU so changed to sm_52
“/usr/local/cuda-9.1/bin/nvcc -gencode arch=compute_52,code=sm_52 hmm-test.cu -o hmm-test”
I get different error in dmesg and anyway seems HMM is only supported on Pascal and later GPUs right?
just to be sure I’m testing my Maxwell GPU correctly your step
" I modified the nvidia-uvm module sources and recompiled the module to enable HMM support," is not very clear:
all I done is calling “make module” with NV_BUILD_SUPPORTS_HMM=1 argument:
“make module SYSSRC=/usr/src/linux-headers-4.15.0-041500rc2-generic NV_BUILD_SUPPORTS_HMM=1”
delete prveious module
rmmod nvidia_uvm
and then load module with
modprobe nvidia_uvm uvm_hmm=1

notice I don’t need to modify source code

thanks…

Kevin127 · May 30, 2018, 7:56pm

Hi,

Did anyone successfully run HMM in NVIDIA driver and Linux with the latest NVIDIA driver on Pascal or other GPUs? If you did, which GPU/ driver version/ Linux version did you use?

Thanks,
Kevin

zhaojp · August 9, 2018, 9:53am

Up! any guild or example from Nvidia to try or test HMM?
appreciated!

hwinkler · August 25, 2018, 1:55pm

Yes please… guidance from nVidia please.

tyos · June 14, 2019, 8:30am

[s]Maybe it’s too late, but I did. I needed the below command before I run my code.

$ sudo modprobe -r nvidia_uvm
$ sudo modprobe nvidia_uvm uvm_hmm=1 uvm8_ats_mode=0

I needed to set uvm8_ats_mode 0 in addition to uvm_hmm.

I used V100 GPU on Skylake, nvidia driver: 418.67 (CUDA 10.1), Ubuntu 18.04LTS, Linux 4.15[/s]

Hmm, I couldn’t run HMM with the above configs at this time… maybe there was mistake around my test code. Sorry for my confusing post…

ianz26vw · December 11, 2019, 5:37am

2 years later and this still does not seem to work despite all the hype from NVIDIA about it.

Using Kernel 5.0.0 with HMM support, Pascal GPU, and driver 440.33.01

modprobe nvidia_uvm uvm_hmm=1 uvm_ats8_mode=0

The test program still fails with an Xid Error. :/

val.zapod.vz · August 2, 2022, 4:00pm

github.com/NVIDIA/open-gpu-kernel-modules

HMM support in UVM

opened 05:04AM - 29 Jul 22 UTC

VivekPanyam

Feature Pending NV-Triaged

### NVIDIA Open GPU Kernel Modules Version 515.57 Release ### Does this happen… with the proprietary driver (of the same version) as well? Yes ### Operating System and Version Ubuntu 20.04.1 LTS ### Kernel Release 5.13.0-1029-aws ### Hardware: GPU NVIDIA T4 ### Describe the bug Hello! HMM support has been mentioned in several NVIDIA docs and presentations since 2017 (including the announcement of the open-source kernel modules), but it seems to be disabled here (and doesn't work when using the proprietary driver). https://github.com/NVIDIA/open-gpu-kernel-modules/blob/d8f3bcff924776518f1e63286537c3cf365289ac/kernel-open/nvidia-uvm/uvm_hmm.h#L35-L50 I assume the referenced bug/task is internal. Is there any information you can share on what additional work needs to happen to enable UVM-HMM (or potentially a timeline?). See references and a repro below. ### To Reproduce I'm testing HMM using the following code (from one of the presentations linked below): ```cpp #include <stdio.h> #define LEN sizeof(int) __global__ void compute_this(int *pDataFromCpu) { atomicAdd(pDataFromCpu, 1); } int main(void) { int *pData = (int*)malloc(LEN); *pData = 0; // Run on GPU: compute_this<<<512,1000>>>(pData); cudaDeviceSynchronize(); printf("Results: %d\n", *pData); free(pData); return 0; } ``` It currently just prints `Results: 0` and the following message is in the output of `dmesg` ``` [ 3896.223804] NVRM: Xid (PCI:0000:00:1e): 31, pid=42009, name=a.out, Ch 00000007, intr 00000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_T1_0 faulted @ 0x5558_a1468000. Fault is of type FAULT_PDE ACCESS_TYPE_VIRT_ATOMIC ``` ### Bug Incidence Always ### nvidia-bug-report.log.gz N/A ### More Info ## References: * 2017 GTC talk - [Using HMM to Blur the Lines between CPU and GPU Programming](https://on-demand.gputechconf.com/gtc/2017/presentation/s7764_john-hubbardgpus-using-hmm-blur-the-lines-between-cpu-and-gpu.pdf) * 2019 GTC talk - [Memory Management on Modern GPU Architectures](https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9727-memory-management-on-modern-gpu-architectures.pdf) * [The announcement post for this repo](https://developer.nvidia.com/blog/nvidia-releases-open-source-gpu-kernel-modules/) * A 2017 Red Hat summit presentation - [GPUs: HMM: Heterogeneous Memory Management](https://www.redhat.com/files/summit/session-assets/2017/S104078-hubbard.pdf)

steve174 · March 19, 2023, 6:09pm

I did an eval recently with HMM+UVM. The current implementation is so poor, that my debian kernel crashes before or during any type of performance benchmark.

I think it has taken 5+ years for this technology to make it to market. I have to wonder if it will be worth the hassle.

Currently, this technology is extremely immature and unreliable.

Thank you,
Steve Dake

Topic		Replies	Views
Issue Activating HMM Feature on NVIDIA RTX A4500 with CUDA Toolkit 12.4 on Debian Bookworm CUDA Setup and Installation	6	444	December 12, 2024
Simplifying GPU Application Development with Heterogeneous Memory Management Technical Blog	0	383	August 22, 2023
Nvidia process not running Linux	25	2814	December 31, 2021
Enabling Heterogeneous Memory Management CUDA Setup and Installation	0	1488	April 30, 2018
/dev/nvidia-uvm IO error on Ubuntu 22.04, 520 to 535 driver versions Linux cuda , opencl , linux-driver	2	2843	August 27, 2023
Non-existent shared VRAM on NVIDIA Linux drivers Linux	54	9900	December 12, 2024
HMM support in Linux driver CUDA Setup and Installation	3	1866	June 30, 2020
Broken GPU state query failure in AMD + H100 Confidential Computing	10	919	February 15, 2024
NVIDIA Linux x86_64 340.98 Driver not building with Kernel 4.9.0-rc2+ [RESOLVED] Linux	8	6910	October 14, 2021
Driver allocating memory over pci slot size Linux kernel	12	2579	February 16, 2021

Heterogeneous Memory Support (HMM) in NVIDIA UVM driver and Linux 4.14

Related topics