Cannot get nvidia-smi to work with 1050 and Ubuntu 18.04

(Bug report, installer log attached)

Hi, I am following the official NVIDIA install guide to the letter. I am installing from local .deb file in the instructions(cuda-repo-ubuntu1804-10-1-local-10.1.168-418.67_1.0-1_amd64.deb)

Based on my system, NVIDIA recomments CUDA 10.1 and nvidia-driver-415. I am not able to get nvidia-smi to recognize my card.
I’ve gone through many past threads on similar issues but haven’t made any progress.

Details:

I currently have both the 1050 an the AMD card installed. Do I need to remove the AMD?

lspci -v | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1) (prog-if 00 [VGA controller])
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV610 [Radeon HD 2400 PRO] (prog-if 00 [VGA controller])
cat /proc/driver/nvidia/gpus/0000\:01\:00.0/information
Model: 		 Unknown
IRQ:   		 30
GPU UUID: 	 GPU-????????-????-????-????-????????????
Video BIOS: 	 ??.??.??.??.??
Bus Type: 	 PCIe
DMA Size: 	 47 bits
DMA Mask: 	 0x7fffffffffff
Bus Location: 	 0000:01:00.0
Device Minor: 	 0
Blacklisted:	 No

List of drivers:

ubuntu-drivers list
nvidia-driver-430
nvidia-driver-410
nvidia-driver-415
nvidia-driver-396
nvidia-driver-390
nvidia-driver-418
nvidia-396

This is what I see in dmesg:

[   20.117226] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  418.67  Sat Apr  6 03:07:24 CDT 2019
[   20.718025] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  418.67  Sat Apr  6 02:43:09 CDT 2019
[   20.949845] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   20.949848] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[   21.625260] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 238
[   22.294245] resource sanity check: requesting [mem 0x000e0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000effff window]
[   22.294729] caller os_map_kernel_space.part.6+0x6d/0x80 [nvidia] mapping multiple BARs
[   26.292785] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[   26.292839] NVRM: rm_init_adapter failed for device bearing minor number 0
[   36.532230] IPv6: ADDRCONF(NETDEV_UP): enp8s0: link is not ready
[   38.115894] tg3 0000:08:00.0 enp8s0: Link is up at 100 Mbps, full duplex
[   38.115900] tg3 0000:08:00.0 enp8s0: Flow control is on for TX and on for RX
[   38.115917] IPv6: ADDRCONF(NETDEV_CHANGE): enp8s0: link becomes ready
[   42.158019] new mount options do not match the existing superblock, will be ignored
[   43.417006] kauditd_printk_skb: 10 callbacks suppressed
[   43.417008] audit: type=1400 audit(1560872456.370:22): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/sys/devices/system/node/" pid=1030 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[   44.104129] audit: type=1400 audit(1560872457.054:23): apparmor="DENIED" operation="capable" profile="/usr/sbin/mysqld" pid=1030 comm="mysqld" capability=2  capname="dac_read_search"
[   45.279800] audit: type=1400 audit(1560872458.226:24): apparmor="DENIED" operation="open" profile="/usr/sbin/mysqld" name="/sys/devices/system/node/" pid=1202 comm="mysqld" requested_mask="r" denied_mask="r" fsuid=111 ouid=0

[  101.418767] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[  101.418825] NVRM: rm_init_adapter failed for device bearing minor number 0
[  113.692164] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[  113.692212] NVRM: rm_init_adapter failed for device bearing minor number 0
[  113.821683] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[  113.821768] NVRM: rm_init_adapter failed for device bearing minor number 0
[  113.938213] NVRM: RmInitAdapter failed! (0x31:0xffff:834)
[  113.939084] NVRM: rm_init_adapter failed for device bearing minor number 0

Output of nvidia-smi:

nvidia-smi
No devices were found

Kernel info:

uname -a
Linux homeserver 4.15.0-51-generic #55-Ubuntu SMP Wed May 15 14:27:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I’ve rebooted several times as well.

Would appreciate some help. The CUDA sample files compile, show CUDA being detected, but obviously, deviceQuery doesn’t work when actually run.

nvidia-installer.log (29 KB)
nvidia-bug-report.log.gz (665 KB)

Might be a hardware failure. Upgrade system bios, reseat card, check in another system.

I’ve also updated my bug report and installer logs.
I unfortunately don’t have another system to try on. I’ll try and check the BIOS when I get home today.

In the mean time, do you think I need to remove my AMD?

EDIT: It doesn’t look like my BIOS has any updates (last one was 2012)
I know this system is compatible, hardware wise with the 1050 Ti (source)

The radeon shouldn’t matter, you can remove it for testing, of course. My guess is either the 1050 is broken or there’s some bios incompatibility, though.

More updates:
a) I’ve tried flashing the BIOS from A04 to A11 (A11 is the latest firmware in 2012 for my T5400) - no change
b) I’ve tried turning off secure memory as well as making gp_delay=1 in grub via GRUB_CMDLINE_LINUX_DEFAULT=“quiet rcutree.rcu_idle_gp_delay=1 mem_encrypt=off” - no change
c) Removing the AMD card made no difference

At this stage, given I’ve run out of options, I’ve ordered a new card.

Questions:

a) Is the NVIDIA official guide the absolute correct guide to follow for Ubuntu 18.04 and a 1050 Ti? It installs 430 while I’ve noticed several other blogs/sits recommend 396.

b) If it is indeed a BIOS issue with my T5400 (even though it is supposed to be compatible with this GPU card), how would I know?

A bios incompatibility is unprobable.
396 is outdated, doesn’t support latest cuda.
the nvidia guide is basically for Teslas. Don’t use it when you’re running graphics.
Do:

  • Don’t use the .run installers, use --uninstall to uninstall them
  • purge anything nvidia/cuda
  • add the ubuntu graphics ppa https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
  • install the driver from that (sudo apt install nvidia-driver-430)
  • download the cuda .deb
  • add the repo to your system (first three steps from install instructions on download page)
  • don’t install cuda
  • instead, run sudo apt install cuda-toolkit-10-1

Thank you, that looks much easier than all the shenanigans of the guide. I’ll try later today when I get the replacement GPU and report.

@generix, is the URL https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa correct? I get:

E: The repository 'https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa bionic Release' does not have a Release file.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.

I previously added ppa:graphics-drivers/ppa which I have gone back to. It seems to be the new one that NVIDIA folks recommend.

Oh dear. That’s the website of the ppa with info how to add it, just click on it.

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update

which you seem to have already done.

Ah how silly of me. Yes, I already have the correct ppa added. I’m waiting for my new GPU to come in today and then I’ll install the drivers. Will update.

Hurray! It was a faulty GPU!

nvidia-smi
Wed Jun 19 19:36:07 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   43C    P0    N/A /  75W |      0MiB /  4039MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thank you, @generix for your succinct inputs.