prime-select won't enable use of discrete GPU when booted into intel mode

On a previous machine, I was able to boot into intel mode (i.e. prime-select intel, reboot) and then, in a terminal, prime-select nvidia and utilize the Nvidia GPU (specifically nvidia-docker). This allowed me to use the integrated graphics for my display and freed up the GPU for the processing I cared about. I just upgraded to a new laptop and now if I boot into intel mode, I can’t figure out how to run processing jobs on the NVIDIA GPU. When I run prime-select nvidia in a terminal, I ahve been unable to successfully run nvidia-smi or start a docker container with the nvidia-runtime. I run with the built in display as well as dual external monitors, and between the monitors and an open web browser, almost all of the GPU memory is consumed when I boot into NVIDIA mode.

The previous configuration was

  • Dell Precision 7720
  • NVIDIA P5000 with 384 driver
  • Ubuntu 16.04

The new configuration is

  • Dell Precision 7740
  • NVIDIA RTX 5000 with 418.56 driver
  • Ubuntu 16.04

The 418.56 driver was installed via apt-get and was the only driver I could get to work. All of the other 418 drivers I downloaded from the NVIDIA site resulted in login loops as did all of the 430 drivers including the one installed via apt-get.

In general, the nvidia card has to be turned on when botted to intel mode

sudo tee /proc/acpi/bbswitch <<<ON

After that, the nvidia modules have to be loaded but I guess they’re blacklisted.

I’m pretty sure that the nvidia modules are not blacklisted. There are nvidia entries in /etc/modprobe.d, but they are old versions, and nvidia-418-updates, and things like that. I tried the bbswitch command above, but it didn’t seem to have an effect. What is the official/best process for running something on the GPU while booted into intel mode? Should I be just telling bbswitch to turn on, or should I be doing that in addition to prime-select? And if so, in which order? I also tried installing bumblebee and configuring based on instructions I found for configuring it for 418 and 16.04, but it broke lightdm and I had to uninstall.

The prime-select stuff is Ubuntu specific so there’s no “official” way. As a sidenote, 16.04 isn’t really the best option for this since prime-select is also changing paths which is why you have to run prime-select nvidia to get access to nvidia-smi etc.
so you would have to try

sudo tee /proc/acpi/bbswitch <<<ON
sudo prime-select nvidia
sudo modprobe nvidia
nvidia-smi

should work if the modules are not blacklisted, to check

grep blacklist /etc/modprobe.d/* /lib/modprobe.d/*

More advanced would be to switch to Ubuntu 18.04 and use the new render offload feature:
http://download.nvidia.com/XFree86/Linux-x86_64/435.17/README/primerenderoffload.html

otherwise, you could also configure 18.04 for fixed intel graphics with nvidia compute:
https://devtalk.nvidia.com/default/topic/1043405/linux/ubuntu-18-04-headless_390-intel-igpu-after-prime-select-intel-lost-contact-to-geforce-1050ti/post/5293003/#5293003

For various reasons, our organization is not allowed to do official processing using 18.04, so that is not an option for me.
Those commands don’t work. They used to work with my previous laptop using a different GPU, driver, and laptop firmware. Running those commands immediately after booting into intel mode results in

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Running

lsmod |grep nvidia

printed nothing.

I ran

grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* |grep blacklist

and got

/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-current
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-173
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-96
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-current-updates
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-173-updates
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-96-updates
/etc/modprobe.d/nvidia-graphics-drivers.conf:blacklist nvidia-418-updates

Runing

sudo modprobe nvidia-uvm

(which used to work just fine on my old system) results in

modprobe: ERROR: could not insert 'nvidia_418_uvm': Unknown symbol in module, or unknown parameter (see dmesg)

and the most recent output of dmesg is

[  661.322380] nvidia: probe of 0000:01:00.0 failed with error -1
[  661.322394] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  661.322394] NVRM: None of the NVIDIA graphics adapters were initialized!
[  661.322510] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
[  661.499398] PKCS#7 signature not signed with a trusted key
[  661.508888] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[  661.509187] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[  661.509213] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:1eb5) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  661.509222] nvidia: probe of 0000:01:00.0 failed with error -1
[  661.509235] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  661.509236] NVRM: None of the NVIDIA graphics adapters were initialized!
[  661.509342] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
[  661.665045] PKCS#7 signature not signed with a trusted key
[  661.675340] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[  661.675627] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
[  661.675651] NVRM: The NVIDIA GPU 0000:01:00.0
               NVRM: (PCI ID: 10de:1eb5) installed in this system has
               NVRM: fallen off the bus and is not responding to commands.
[  661.675660] nvidia: probe of 0000:01:00.0 failed with error -1
[  661.675672] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  661.675673] NVRM: None of the NVIDIA graphics adapters were initialized!
[  661.675771] nvidia-nvlink: Unregistered the Nvlink Core, major device number 238
[  661.890853] PKCS#7 signature not signed with a trusted key
[  661.896110] PKCS#7 signature not signed with a trusted key

The messages point to the gpu still being turned off. What’s the dmesg output after trying to turn it on using bbswitch?
Please post the output of
sudo dmesg |grep bbswitch
You might be hitting an acpi/pci bug with your new machine that leads to the gpu not being able to power on again once powered off. To circumvent this, just blacklist the bbswitch module.

sudo dmesg |grep bbswitch results in no output.
I tried to add

blacklist bbswitch

to /etc/modprobe.d/blacklist.conf

and ran

sudo update-initramfs -u

and rebooted, but

sudo lsmod |grep bbswitch

gave me

bbswitch               16384  0

I did find the following group of messages in syslog:

Aug 28 08:54:36 Precision-7740 kernel: [   18.014192] bbswitch: version 0.8
Aug 28 08:54:36 Precision-7740 kernel: [   18.014207] input: HDA Intel PCH HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input18
Aug 28 08:54:36 Precision-7740 kernel: [   18.014252] input: HDA Intel PCH HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input19
Aug 28 08:54:36 Precision-7740 kernel: [   18.014276] bbswitch: Found integrated VGA device 0000:00:02.0: \_SB_.PCI0.GFX0
Aug 28 08:54:36 Precision-7740 kernel: [   18.014294] input: HDA Intel PCH HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input20
Aug 28 08:54:36 Precision-7740 kernel: [   18.014463] input: HDA Intel PCH HDMI/DP,pcm=10 as /devices/pci0000:00/0000:00:1f.3/sound/card0/input21
Aug 28 08:54:36 Precision-7740 kernel: [   18.014645] bbswitch: Found discrete VGA device 0000:01:00.0: \_SB_.PCI0.PEG0.PEGP
Aug 28 08:54:36 Precision-7740 kernel: [   18.014663] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20170831/nsarguments-100)

FWIW, in this state, /proc/acpi/bbswitch says

0000:01:00.0 ON

and it appears to indicate that it is turning the card on and off, but I still get the same messages in dmesg

Try using
nogpumanager
kernel parameter.