rs277,
Sorry, linux noob here (had to search how to view syslog), I filtered all the relivant “nvidia” lines from the syslog, I can post more from the syslog if you think I missed something:
journalctl|grep "nvidia"
Feb 03 08:44:53 bze-server kernel: nvidia: loading out-of-tree module taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia: module license 'NVIDIA' taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Feb 03 08:44:53 bze-server kernel: nvidia: module license taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Feb 03 08:44:53 bze-server kernel: audit: type=1400 audit(1706967893.231:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server kernel: audit: type=1400 audit(1706967893.231:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server audit[772]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server audit[772]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.154.05 Thu Dec 28 15:51:29 UTC 2023
Feb 03 08:44:53 bze-server kernel: [drm] [nvidia-drm] [GPU ID 0x00008300] Loading driver
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to allocate NvKmsKapiDevice
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to register device
Feb 03 08:44:53 bze-server kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Feb 03 08:44:53 bze-server kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Verbose syslog connection opened
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Now running with user ID 129 and group ID 137
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Started (1034)
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: device 0000:83:00.0 - registered
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Local RPC services initialized
Feb 03 08:45:09 bze-server nvidia-settings-autostart.desktop[2766]: ERROR: A supplied argument is invalid
Feb 03 08:53:28 bze-server sudo[6671]: brett : TTY=pts/0 ; PWD=/home/brett ; USER=root ; COMMAND=/usr/bin/nvidia-smi
I think the important ERROR lines are:
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to allocate NvKmsKapiDevice
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to register device
and
Feb 03 08:45:09 bze-server Feb 03 08:45:09 bze-server nvidia-settings-autostart.desktop[2766]: ERROR: A supplied argument is invalid[2766]: ERROR: A supplied argument is invalid
Not sure how to fix the first two, but the desktop entry referenced in the third error is
cat /etc/xdg/autostart/nvidia-settings-autostart.desktop
[Desktop Entry]
Type=Application
Encoding=UTF-8
Name=NVIDIA X Server Settings
Comment=Configure NVIDIA X Server Settings
Exec=sh -c ‘/usr/bin/nvidia-settings --load-config-only’
Terminal=false
Icon=nvidia-settings
Categories=System;Settings;
Just running sh -c ‘/usr/bin/nvidia-settings’ pops open a nice looking nvidia GUI, but I get the following error in the terminal:
sh -c '/usr/bin/nvidia-settings'
ERROR: A supplied argument is invalid
(nvidia-settings:11033): GLib-GObject-CRITICAL **: 09:18:04.716: g_object_unref: assertion 'G_IS_OBJECT (object)' failed
** (nvidia-settings:11033): CRITICAL **: 09:18:04.718: ctk_powermode_new: assertion '(ctrl_target != NULL) && (ctrl_target->h != NULL)' failed
ERROR: nvidia-settings could not find the registry key file or the X server is not
accessible. This file should have been installed along with this driver at
/usr/share/nvidia/nvidia-application-profiles-key-documentation. The
application profiles will continue to work, but values cannot be
prepopulated or validated, and will not be listed in the help text. Please
see the README for possible values and descriptions.
** Message: 09:18:04.767: PRIME: No offloading required. Abort
** Message: 09:18:04.767: PRIME: is it supported? no
So I am stuck again. I really appreciate your help as it seems this has been asked several times before. I assumed that since someone was able to get the P40 running in a r720 that I would be able to get it working in a generation newer enterprise server.
So I bought both the server and the P40 used… separately. Since Tesla cards are harder to debug as one cannot simply plug in a display to see if it is working, is there any possibility that the card is detecting but not loading the driver due to physical damage?