Hello,
I bought last year for work a HPE server with a RTX 4000 graphic card for machine learning. I wanted to use it last month and discovered that the GPUs are not detected (using tensorflow).
I can’t find how to solve my driver issue.
I follow this tutorial to install the driver for the RTX 4000 : NvidiaGraphicsDrivers - Debian Wiki
Here are some output I get :
> lspci -nn | egrep -i "3d|display|vga"
01:00.1 VGA compatible controller [0300]: Matrox Electronics Systems Ltd. MGA G200eH3 [102b:0538] (rev 02)
61:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL [Quadro RTX 4000] [10de:1eb1] (rev a1)
Then I install nvidia-detect and run
> nvidia-detect
Detected NVIDIA GPUs:
61:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL [Quadro RTX 4000] [10de:1eb1] (rev a1)
Checking card: NVIDIA Corporation TU104GL [Quadro RTX 4000] (rev a1)
Your card is supported by the default drivers.
It is recommended to install the[nvidia-bug-report.log|attachment](upload://zWOhimQAmXTYJscXGN65G0QaTbL.log) (368.6 KB)
nvidia-driver
package.
So I install nvidia-driver and reboot the server as it’s described in the tutorial but I get :
> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I tried to upgrade nvidia-driver and to install via backports but I had the same issue.
I also identify this problem :
> sudo systemctl status nvidia-persistenced.service
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2020-09-28 11:45:04 CEST; 1h 55min ago
Process: 775 ExecStart=/usr/bin/nvidia-persistenced --user nvpd (code=exited, status=1/FAILURE)
Process: 808 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
sept. 28 11:45:04 SI-UNICORN-269 systemd[1]: Starting NVIDIA Persistence Daemon...
sept. 28 11:45:04 SI-UNICORN-269 nvidia-persistenced[777]: Started (777)
sept. 28 11:45:04 SI-UNICORN-269 nvidia-persistenced[777]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 112 has read and write permissions fo
sept. 28 11:45:04 SI-UNICORN-269 nvidia-persistenced[775]: nvidia-persistenced failed to initialize. Check syslog for more details.
sept. 28 11:45:04 SI-UNICORN-269 nvidia-persistenced[777]: Shutdown (777)
sept. 28 11:45:04 SI-UNICORN-269 systemd[1]: nvidia-persistenced.service: Control process exited, code=exited, status=1/FAILURE
sept. 28 11:45:04 SI-UNICORN-269 systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
sept. 28 11:45:04 SI-UNICORN-269 systemd[1]: Failed to start NVIDIA Persistence Daemon.
I thought I found a solution here but I can’t disable Secure Boot in my bios as indicated. I can see the parameter in the security section, but I don’t have access to the button.
My configuration :
> uname -a
Linux 5.4.0-1-amd64 #1 SMP Debian 5.4.6-1 (2019-12-27) x86_64 GNU/Linux
Any idea of the issue here ?
Thanks for your help