375.10 - bad experience

Not really sorry if I appear somewhat grumpy. After 8 hours of fixing the mess that is NVIDIA-Linux-x86_64-375.10.run I believe I can.

I had enough energy left to register in this forum to vent my frustration, so while I do have read
https://devtalk.nvidia.com/default/topic/522835/linux/if-you-have-a-problem-please-read-this-first/
and while I do have a nvidia-bug-report.log (which I had to anonymize 1st) I see no “attach” link.

First things 1st: Lenovo P50 with a Quadro M2000M and a Skylake Xeon E3-1505 (which has an integrated P530)

I really would like/demand an explanation for this:

[ 4186.788284] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:13b0)
               NVRM: installed in this system is not supported by the 375.10
               NVRM: NVIDIA Linux driver release.  Please see 'Appendix
               NVRM: A - Supported NVIDIA GPU Products' in this release's
               NVRM: README, available on the Linux driver download page
               NVRM: at www.nvidia.com.

Because - and I won’t quote the Appendix A of the README - the M2000M is supported of course. Hell - the error message even shows the exact PCI Id.

[ 4186.788305] nvidia: probe of 0000:01:00.0 failed with error -1
[ 4186.788369] nvidia-nvlink: Nvlink Core is being initialized, major device number 247
[ 4186.788402] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 4186.788404] NVRM: None of the NVIDIA graphics adapters were initialized!
[ 4186.788407] nvidia-nvlink: Unregistered the Nvlink Core, major device number 247
[ 4186.788630] NVRM: NVIDIA init module failed!
[ 4187.163800] vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=none,decodes=none:owns=none
[ 4187.163838] NVRM: The NVIDIA GPU 0000:01:00.0 (PCI ID: 10de:13b0)

As I said - it took me about 8 hours to have a working system, so I could at least write this little cryout. Of course, I have lots of highly qualified feedback about serious pitfalls with the Nvidia driver, a hybrid configuration and Linux Kernel > 4.7 (currently I’m on 4.8.4).

So in case you’re interested, let’s start a discussion. Maybe tomorrow I’ll be even constructive.

nvidia-bug-report.log.gz (48.7 KB)

Hi gpuxplorer.

I’m sorry you had a bad experience.

The forum’s file attach link is a little hard to find. You have to post your message first and then hover over it with the mouse and click the small green paper clip icon that pops up. If you’re concerned about the contents of your bug report being public, you can email it to linux-bugs@nvidia.com. If you’re still concerned, my PGP key is F56ACC8F09BA9635.

When this particular error shows up for a GPU that really should be supported, it’s typically because some other system component (typically the “bbswitch” module) cut power to the GPU or otherwise attempted to save power in a way that interfered with communication between the CPU and the GPU. I’d recommend looking for something like that and disabling it. I filed bug #1834157 internally to try to detect this better and provide a better error message.

Thanks Aaron for the follow-up.

Ok, attachment uploaded. I had to remove some company-specific contents from PATH.

In general, my humble but educated opinion about providing drivers would be:

Be more up-to-date! I believe we are talking about development here, not the gamers. Maybe it’s just me, but I assume for software development people do not use distributions lagging way behind the current software versions.

The current Linux Kernel is 4.8.5, the current GCC is 6.2.x

Sure, for production the software is being tested on Debian, Ubuntu and other Dinosaurs, but that doesn’t help if the drivers do not integrate well in your development environment.
It cannot be that hard to have an automated build/testing/release process that spills out -NIGHTLY- - can it? May I help?


There was no bbswitch, bumblebee or similar present on the computer. My intention is to get a working heterogenous OpenCL development environment, where I have

  • A Skylake CPU platform
  • A P530 GPU platform
  • A Nvidia Quadro M2000M platform

I want to use the Nvidia solely for computing purposes, so I believe I do not even need any fancy switching stuff.


Because I have a job to get done, I set up another development environment as there was no way to get Nvidia OpenCL integrated in my Gentoo. So I’m running Arch now - guess what: 4.8.4 kernel which is default there and gcc 6.2. (which was even masked in Gentoo).

Because Arch brings its own gcc5 dependency and binaries take some burden, 370.28 now runs on 4.8.4 here. Basically mission accomplished, I’d still leave the thread open as it addresses the more basic problem of integration of Nvidia software into Linux distros. More feedback:


The 375 nvidia installer complains about a running X server, although that X-Server runs on a P530 Skylake GPU here. I’m pretty sure it could ignore it altogether so I turned that check off and voila! it went through. It would be nice if the installer script was a little bit more sentient about such a situation (X server running on a platform we need?)


Since Kernel 4.7 you can trim unused Kernel symbols. See

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.7-Trim-Unused-KSYMS

Under no circumstances enable the CONFIG_TRIM_UNUSED_KSYMS option in the kernel if you want to install the external drivers. Naturally, the kernel does not know about the Nvidia drivers to come from an external source (and to need some symbols the kernel itself may not see the need for).

So all of the sudden, the nvidia driver sees no VGA Arbiter, no set_cpufreq and about 40 other symbols it needs.

EOT

You made the mess yourself!

You have mixed nvidia versions installed and you expect the beta driver to work after this!

[  3581.687] (II) NVIDIA GLX Module <b> 375.10</b>  Fri Oct 14 10:01:22 PDT 2016
[  3581.687] (II) LoadModule: "intel"
[  3581.687] (II) Loading /usr/lib64/xorg/modules/drivers/intel_drv.so
[  3581.687] (II) Module intel: vendor="X.Org Foundation"
[  3581.687] 	compiled for 1.18.4, module version = 2.99.917
[  3581.687] 	Module class: X.Org Video Driver
[  3581.687] 	ABI class: X.Org Video Driver, version 20.0
[  3581.687] (II) LoadModule: "nvidia"
[  3581.688] (II) Loading /usr/lib64/xorg/modules/drivers/nvidia_drv.so
[  3581.688] (II) Module nvidia: vendor="NVIDIA Corporation"
[  3581.688] 	compiled for 4.0.2, module version = 1.0.0
[  3581.688] 	Module class: X.Org Video Driver
[  3581.688] (II) intel: Driver for Intel(R) Integrated Graphics Chipsets:
	i810, i810-dc100, i810e, i815, i830M, 845G, 854, 852GM/855GM, 865G,
	915G, E7221 (i915), 915GM, 945G, 945GM, 945GME, Pineview GM,
	Pineview G, 965G, G35, 965Q, 946GZ, 965GM, 965GME/GLE, G33, Q35, Q33,
	GM45, 4 Series, G45/G43, Q45/Q43, G41, B43
[  3581.689] (II) intel: Driver for Intel(R) HD Graphics: 2000-6000
[  3581.689] (II) intel: Driver for Intel(R) Iris(TM) Graphics: 5100, 6100
[  3581.689] (II) intel: Driver for Intel(R) Iris(TM) Pro Graphics: 5200, 6200, P6300
[  3581.689] (II) NVIDIA dlloader X Driver <b> 370.28</b>  Thu Sep  1 18:51:40 PDT 2016

I see.

Still not sure how the term “yourself” applies, as I cannot remember to have written the NVIDIA-Linux-x86_64-375.10.run script.

If you meant I made the mess by executing that script on my former distro: you’re right.

If you run the old .run file first to uninstall you wouldn’t get the leftovers.
375.10 is a beta driver so expect a few issues.

Hi!

Let me add to the picture that this is also fully reproducible without bbswitch (or any special tool) as I have described in https://devtalk.nvidia.com/default/topic/971733/linux/-370-28-with-kernel-4-8-on-gt-2015-machines-driver-claims-card-not-supported-if-nvidia-is-not-primary-card/ with a vanilla 4.8 kernel.

In short, you need:

  • = Kernel 4.8

  • PCI runtime power management enabled (e.g. by laptop mode tools). I’m not sure whether this point is a hard requirement.
  • a BIOS released >=2015.
    In that case, starting from kernel 4.8, “pcie_port_pm” will be active by default.
    In case the nvidia card is not actively driving any output from the very beginning, e.g. by driving the primary screen via efifb pr whatever, the kernel will power down the PCIe link.

The nvidia driver (at least as of 370.28) is unable to detect this situation and power it up again and complains with this not-so-helpful message.

As a quick workaround whether this is the issue, you can try “pcie_port_pm=off” to force-disable power saving (which was the default pre-4.8).

I have not yet tested the most recent beta release, maybe that fixes it…

Hi All,
We are trying to reproduce this issue.

  • Can we get detail reproduction steps ?
  • ALso nvidia bug report of the system on which issue reproduce?
  • What is the model of the system in which issue reproduce ?
  • Is OS need to be installed in UEFI mode?
  • Is the issue reproduce on other distros like Ubuntu, Fedora etc, OR its specific to Gentoo?
  • Is the kernel built with any custom config?

Hi @sandipt,

just in case it was missed, I have already provided all the information here:
https://devtalk.nvidia.com/default/topic/971733/linux/-370-28-with-kernel-4-8-on-gt-2015-machines-driver-claims-card-not-supported-if-nvidia-is-not-primary-card/post/5014713/#5014713

Cheers,
Oliver

Tracking this issue under : 1835588 . I think good to keep only one thread for this issue.

https://devtalk.nvidia.com/default/topic/971733/linux/-370-28-with-kernel-4-8-on-gt-2015-machines-driver-claims-card-not-supported-if-nvidia-is-not-primary-card/