Please lift NUMA dependency of CUDA or provide a test for it in the installer and kernel module

Hello,

for reasons unknown to me does the CUDA library require the presence of NUMA in the Linux kernel.

The library will try to open /sys/devices/system/node/, which is a part of a kernel’s NUMA support, but will fail when such support was not compiled into the kernel.

NUMA support is not compiled into every Linux kernel, because many machines do not need it, and as a result will CUDA not work on these machines.

The Linux kernel configuration has the following to say about it:

Enable NUMA (Non Uniform Memory Access) support.

	  The kernel will try to allocate memory used by a CPU on the
          local memory controller of the CPU and add some more
          NUMA awareness to the kernel.

          For 64-bit this is recommended if the system is Intel Core i7
          (or later), AMD Opteron, or EM64T NUMA.

          For 32-bit this is only needed if you boot a 32-bit
	  kernel on a 64-bit NUMA platform.

          Otherwise, you should say N.

It appears the recommended setting here is to say No for many machines.

I do not know since when CUDA requires NUMA as I have only recently been using CUDA again and for some time was unable to figure out as to why some software no longer works, which previously ran just fine (i.e. ffmpeg with NVENC on an AMD FX8350 with a GTX960).

Should NUMA now be a hard dependency would it be of help when the installer provides a test for it. A warning by the nvidia kernel module would also be of help.

If NUMA is not essential then it would be nice if the dependency could get lifted and one could again use Linux kernels without the explicit NUMA support.

Cheers

Driver version is 396.51 for Linux x86_64, CUDA version is 9.2.88, kernel version is 4.17.12.
CPU is AMD FX8350, graphics card is Nvidia GTX960.

I can only second this plea, especially that for my systems, even with NUMA support fully enabled in the kernel, CUDA and OpenCL are broken and won’t work at all with drivers after v396.24 (which was the last working version for me).

Could we PLEASE, get a reaction and/or explanation from a n NVIDIA offical on this crucial SHOW STOPPER ???

Hi dinosaur_,

The CONFIG_NUMA requirement wasn’t intentional and is being tracked in bug 2316155.

I checked your bug report from the other thread but I don’t see anything obviously wrong and I haven’t been able to reproduce a similar problem myself when CONFIG_NUMA is enabled.

Is there a relevant error log from BOINC showing the problem?

Good to hear, thank you !

No error, just “No GPU found” in the boinc log… I also checked that there was no missing library with ldd (but I suppose the CUDA/OpenCL libraries are loaded by BOINC itself, not by the dynamic linker on start up), without anything wrong detected. Nevertheless, could it be related with the change in the libraries major numbers that occurred in v396.45 ?

A couple more things to note:

  • It happens on the two systems I run BOINC on (one Core-i5 2500K with GTX970 on ASRock Z68 Extreme4 Gen3 motherboard, and one Core2 Quad Q6600 with GTX660 on GIGABYTE GA-EX38-DS4 motherboard).
  • When BOINC starts, the nvidia_uvm module does not get loaded, but loading it manually and restarting BOINC makes no difference.
  • Using persistence (with or without nvidia-persistenced) or not does not change anything.

What exactly is the error from BOINC? Do other CUDA apps work?

I just tried BOINC on my system and it failed to detect the GPU when I just ran it as “boinc” but running it as “/usr/bin/boinc” worked. There’s some sort of bug in BOINC’s GPU detection that makes it fail if it’s not launched with an absolute path. Could that be the problem for you?

I’m using an old BOINC version (7.0.65) that never had any trouble detecting NVIDIA GPUs (I used it over the years with a 8800GT, a GTX460, a GTX660 and a GTX970). Newer versions are indeed utterly buggy and often fail to detect NVIDIA GPUs…
This said, since BOINC v7.0.65 works with all drivers before (and including) v396.24 and fails with the two newest driver versions, this is something to do with your driver, not with BOINC…

As for other CUDA apps, I don’t have any, but see this thread and how others than just me (e.g. tpruzina and alnash) also have problems with other CUDA apps…

The bug is in your driver, period.

Bug still present in v396.54. :-(

I also get they bug. Enabling NUMA doesn’t help.

Error from john the ripper: Error: Failure in clGetPlatformIDs, error code=-1001
Error from mpv: [ffmpeg] AVHWDeviceContext: Could not initialize the CUDA driver API

If I downgrade to 396.24 everything works.

See https://devtalk.nvidia.com/default/topic/1037521/linux/cuda-broken-in-396-24-02-and-396-24-10-vulkan-beta-drivers-on-linux/3

Try enabling CGROUP options, specifically and at least CONFIG_CPUSETS along with NUMA

This showstopper bug (CONFIG_NUMA and CONFIG_CPUSETS requirements) still plagues the newest 410.57 driver.

When will NVIDIA condescend to fix this bug ?

Apparently, this regression got solved in v410.66, despite not being listed as fixed in the change log.

So it too 3 months for NVIDIA to fix that showstopper. I have seen better reactivity… Still, thank you, I guess.