Linux 4.4.168+ and NVIDIA-Linux-x86_64-410.93 and get_user_pages()

Linux 4.4.168+ and NVIDIA-Linux-x86_64-410.93 and get_user_pages()

Hello.

conftest.sh is not setting the correct #defines for linux 4.4.168 and beyond.

Note that linux 4.4.167 and earlier 4.4.y kernels do compile.

Below is the first set of errors from the nvidia-installer.log

Thanks !

– kjh

In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:21:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c: In function 'os_lock_user_pages':
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:120:48: warning: passing argument 6 of 'get_user_pages' makes pointer from integer without a cast [-Wint-conversion]
                                page_count, write, force, user_pages, NULL);
                                                   ^
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-mm.h:44:70: note: in definition of macro 'NV_GET_USER_PAGES'
            get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
                                                                         ^
   In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-pgprot.h:17:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:20,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /usr/src/linux-4.4.170.kjh/include/linux/mm.h:1200:6: note: expected 'struct page **' but argument is of type 'NvBool {aka unsigned char}'
    long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
         ^
   In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:21:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:120:55: warning: passing argument 7 of 'get_user_pages' from incompatible pointer type [-Wincompatible-pointer-types]
                                page_count, write, force, user_pages, NULL);
                                                          ^
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-mm.h:44:77: note: in definition of macro 'NV_GET_USER_PAGES'
            get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
                                                                                ^
   In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-pgprot.h:17:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:20,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /usr/src/linux-4.4.170.kjh/include/linux/mm.h:1200:6: note: expected 'struct vm_area_struct **' but argument is of type 'struct page **'
    long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
         ^
   In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:21:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-mm.h:44:9: error: too many arguments to function 'get_user_pages'
            get_user_pages(current, current->mm, start, nr_pages, write, force, pages, vmas)
            ^
   /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:119:11: note: in expansion of macro 'NV_GET_USER_PAGES'
        ret = NV_GET_USER_PAGES((unsigned long)address,
              ^
   In file included from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-pgprot.h:17:0,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/common/inc/nv-linux.h:20,
                    from /tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.c:15:
   /usr/src/linux-4.4.170.kjh/include/linux/mm.h:1200:6: note: declared here
    long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
         ^
   /usr/src/linux-4.4.170.kjh/scripts/Makefile.build:277: recipe for target '/tmp/selfgz3548/NVIDIA-Linux-x86_64-410.93/kernel/nvidia/os-mlock.o' failed

nvidia-installer-failed-4.4.170.kjh-NVidia-410.93.log (85.2 KB)

Hi kjh,

Thanks for bringing this issue to our attention, we’re tracking internally via bug #2488457

Thanks for the update, lmiddlebrook !

Is there a way to follow the progress of bug #2488457 ?

And if you need a tester, please let me know and I will be happy to oblige :)

Thanks again.

– kjh

I’ll post on this thread when a driver containing the fix is made available. The fix will be included in our next beta driver release.

Thank you lmiddlebrook !

– kjh

The freshly posted beta driver 418.30 contains the fix for this issue.

lmiddlebrook –

Thank you very much for the heads up !

I am happy to report that 418.30 Beta compiled and runs with the 64-bit 4.4.172 Kernel.

Now I’ve got a similar issue with VMWare Workstation so I am back on 4.4.167 with Nvidia 418.30 Beta.

getting there …

Thanks again !

– kjh

EDIT: Modified 4.4.172 4.4.172 to 64-bit 4.4.172 Kernel

hi lmiddlebrook,

is there any possibility that the above fix will be ported also to the legacy releases?
also a patch to apply on the fly will be fine (and it will ease the life of distro maintainers)…

The same error seems to occur for the 340.107 driver on kernel version 4.4.172 machines.

And the same error occurs on the 390.77 driver on 4.4.172 kernel

the problem also occurs on ubuntu 16.04 with kernel 4.4.0-143 and nvidia-384.

This bug broke my Ubuntu 16.04 system during an unattended upgrade last week. kernel 4.4.0-143 was sent through the unattended upgrades channel so the GPU driver (v410) silently went offline. I had to roll back to kernel 4.4.0-142 and disable unattended upgrades to get the application working again.

Fixed using https://www.nvidia.fr/Download/driverResults.aspx/142654/fr
Thx Olivier for direct link, was not listed in drivers for (Quadro) NVS 315

Linux x64 (AMD64/EM64T) Display Driver

Version: 390.116
Date de réalisation: 2019.2.22
Système d’exploitation: Linux 64-bit
Langue: Français
Taille: 78.47 MB

Context : from kernel 4.4.0-142 to kernel 4.4.0-143 on Ubuntu 16.04

How is a new v390 driver a fix for a broken version v410 driver? v390 is not even available to me from the nvidia package repository:

$ apt-cache search nvidia| grep "binary driver"
nvidia-304 - NVIDIA legacy binary driver - version 304.135
nvidia-340 - NVIDIA binary driver - version 340.107
nvidia-384 - NVIDIA binary driver - version 384.130
nvidia-410 - NVIDIA binary driver - version 410.48

If you need to use the legacy releases 340 or 390 on Ubuntu 16.04, please upgrade your HWE stack to get a newer kernel:
https://wiki.ubuntu.com/Kernel/LTSEnablementStack
then add the graphics ppa and install the 340 or 390 driver from there:
https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
If you can use driver versions >=410, you can just add the ppa and install driver version 418.

I’m not sure how to parse your text, “if you can use driver versions >= 410 …”. I can use driver 410 if it successfully compiles. So, assuming that was what you meant I went ahead and installed v418 as per the instructions.

I use “numba -s” to check if the GPU driver is working. The error message from “numba -s” was this with the broken 410 driver:

[100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

with the 418 driver the message is this:

CUDA driver library cannot be found

I think this is a worse error message but I’m not sure. The apt-get install command to install the nvidia-418 package completed successfully.

I got it to work. I also had to install the libcuda1-418 package.

So putting all together, the solution is:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install libcuda1-418 nvidia-418