Cuda broken in 396.24.02 and 396.24.10 Vulkan beta drivers on Linux

Thanks! Appears to work fine both hevc_nvenc and deviceQuery OK with 396.45. Heads up to anyone running cuda or ffmpeg/hevc_nvenc/h264_nvenc though if this makes it through from 396.45 beta to general release it will break any dist not using this kernel config which appears to be more than one.

Hello.

(sorry bad english)

I recompiled the kernel with enabling the NUMA options, as suggested in post 20 by @xts.

zcat /proc/config.gz | grep -i numa

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_ACPI_NUMA=y

The NVIDIA-Linux-x86_64-396.45.run driver recognized cuda-toolkit-9.2.148.

edit: added a small screencast driver 396.45 working with cuda.

https://youtu.be/GFro9V2faUM

Thanks for listening.

# grep -i numa .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
CONFIG_NUMA=y
# CONFIG_AMD_NUMA is not set
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_USE_PERCPU_NUMA_NODE_ID=y
CONFIG_ACPI_NUMA=y

Still having the issue.

I’m still having an issue after enabling the NUMA related kernel config settings. Is there any other additional config needed to setup numa aside from the kernel config? In dmesg I see a line that says “No NUMA configuration found.”

If I run lscpu now I can see “NUMA nodes” = 1

I’ve noticed that on boot nvidia-uvm is still not loaded. I’ve been loading manually using #nvidia-modprobe -u -c 0 after boot but am still unable to get CUDA applications working with latest driver versions. Thanks for any help.

Might be CPU related problem, i.e. if you are running AMD some of the NUMA stuff might not be working? Just a guess. I tested with a Skylake/1070 and a SandyBridge/660 and both worked. Hints as to what may be going on with the unified memory breakage in the driver can be found here :

Its pretty clear some of the ideas discussed are being worked into the driver and released in the betas some of which clearly work only in the dev hardware environment, but not on widespread HW/distributions.

Greetings,

Adding my voice here to say that CUDA/OpenCL is totally broken in the v396.xx driver while it works just fine with older drivers (I currently reverted to 390.77).

BOINC fails to detect the GPU with v396 and no amount of kernel tweaking (i.e. NUMA enabled or not) makes any difference.

Please, NVIDIA, fix the CUDA/OpenCL libraries in your driver !!!

Driver v396.51 is still broken as far as OpenCL/CUDA support is concerned: BOINC still fails to detect any usable GPU with it while with v340.77 (or any older version), everything works like a charm…

Note that I do now have a kernel with NUMA support compiled in, to no avail.

I am attaching here the two relevant nvidia--bug-report.log.gz files.
nvidia-340.77-bug-report.log.gz (103 KB)
nvidia-396.51-bug-report.log.gz (103 KB)

Attaching my .config which works with 396.45.
config.txt (107 KB)

Thanks, but that is irrelevant to my system. I got the exact same NUMA configuration as yours, and yet v356 driver (after v356.24 which does work fine) fails to work as far as CUDA/OpenCL is concerned.

Something changed that broke that driver. NVIDIA needs to fix it !

_dinosaur, your log tells that you’re using persistence mode, check if turning that off helps. From another thread I know that this utilizes numa now.

In my case CONFIG_NUMA=y did the trick, even when it says “No NUMA configuration found”

Removing the persistence mode (which I do need anyway, in order to OC the “application clocks”, see this article) does not change anything. GPU still not seen by BOINC with driver v396.45/51.

.51 is still broken for me, tried bunch of variations on numa/HMM relted kernel configs, but no luck.
gonna look at diff of open sorce. portion of driver between .24 and .51 next, hopefully something will at me.

I will describe a potential solution, although the set of prerequisites for it to be successful is very specific.

I encountered this problem on one of my laptops with GT 730M (with Optimus technology) on Ubuntu 16.04 with driver 396.51. The nvidia-uvm module wouldn’t load automatically, nvidia-modprobe wouldn’t load it either and all CUDA calls fail with code 30, including freshly compiled deviceQuery sample application.

The cause was very stupid. After installing bumblebee a while ago I have blacklisted all the nvidia modules and safely forgot about it, as everything worked well with a driver version I used back then. Usually bumblebee loads nvidia modules on request: when you prefix a command with optirun, it will launch OpenGL applications on Nvidia GPU, rather than on Intel. However, bumblebee have no clue about nvidia_uvm module, which AFAIK is purely for compute, so it would not ever load it. As module is blacklisted nvidia-modprobe fails to load the module as well.

Removing the blacklisting of nvidia_uvm module helped (nvidia and nvidia_modeset can be left blacklisted if you are using bumblebee). And I have CUDA fully functional now on 396.51.

If you are reading this, you may find it useful to check all the .conf files in /etc/modprobe.d/, looking for patterns like blacklist nvidia*uvm as well as to check your kernel command line and other means of blacklisting modules. Someone (potentially, one of the authors of that super duper package, or even you yourself in the past) may have screwed the UVM module by simply blacklisting it. Remove the offending piece and reboot.

Disclaimer: manipulations with kernel modules may break the display manager and many other useful stuff. If happened, boot to runlevel 3 and revert the change. Fingers crossed.

I insmod nvidia-uvm manually so blacklisting isn’t an issue (module loads fine), I have also enabled NUMA related kernel options and verified via strace that this isn’t an issue. Versions higher than 396.24 don’t work for me.

Edit: I checked nvidia-bug-report.sh for .24 and .51 driver and compared diffs, there is virtually no change in logs (apart from timestamps and memory offsets).

I have two PCs with nearly identical Gigabyte GTX 1060s and Gentoo installs.

On my “modern” system (Ivi Bridge Xeon) I have 4.17.12+396.51 working after some jumping around CONFIG_NUMA.

On my “vintage” system (Conroe Celeron 430) I have 4.17.14+396.51 not working no matter how I jump around numa, hmm, uvm, nvidia-modprobe and others. However, 4.17.14+396.24 do work.

I don’t know why it’s so, I just leave it here so that someone could get a hint out of it.

I have the same problem. Currently I am with Linux-4.18 and Nvidia-396.24. No earlier Nvidia is able to allow other software, such as Gimp or ffmpeg, to utilize CUDA.

However, I am experiencing another twist. Even with Nvidia-396.24 cuda does not ALWAYS work. To get cuda to work I found that I have to do the following command:

modprobe -r nvidia-uvm && modprobe nvidia-uvm && clinfo

Here, clinfo is just one example. I also need to do this with Gimp and ffmpeg to get them to utilze cuda.

But sometimes I need to execute the above command 2-3 times before cuda will work.

How do I know cuda is working? I just do the following to see if /dev/nvidia-uvm has been successfully opened by the program:

lsof | grep nvidia-uvm

So there seems to a serious fault in the latest Nvidia drivers for Linux. I hope others can provide more feedback.

I have two systems running on Gentoo, a 1090T AMD with GTX780 and a Xeon E5-1620 with a Quadro K620. The AMD system required the NUMA flag to function again while the Xeon/K620 pair do not function. Oddly it did work once (using deviceQuery and CUDA_VISIBLE_DEVICES set) after the first cold boot with 4.18.5 but never since. I have to wonder if its the way EFI vars are being parsed as I had a similar issue just getting graphical output when I was booting a Xen kernel. The only other thing that comes to mind are the Spectre mitigations as more are enabled on Intel than AMD. In all cases, GCC 7.3.0 was used to compile.

Don’t know what happened to my other account so I’m updating with my personal one. Either genkernel is doing something screwy or the NV driver doesn’t like DRBD. A ‘make distclean && make defconfig’ created a kernel with functional CUDA. I’m not rebooting this thing until our TensorFlow testing is done! :)

Isn’t genkernel using compiler flags from make.conf? I’m not using those, just the kernel standard and cuda works.