Hi,
I cannot make this one work:
I have Dell R730, which works on Ubuntu 22.04.3.
I used Riser 3 and added a P100.
I enabled BIOS GPU Legacy settings - then disabled
I used NVIDIA cuda developer website to download the driver and used
sudo sh cuda_12.2.1_535.86.10_linux.run --no-opengl-libs
in order to prevent from installing OPENGL libraries
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
In the end:
lsmod | grep nvidia
nvidia_drm 77824 0
nvidia_modeset 1302528 1 nvidia_drm
nvidia 56532992 1 nvidia_modeset
nvidiafb 61440 0
vgastate 24576 1 nvidiafb
fb_ddc 16384 1 nvidiafb
i2c_algo_bit 16384 2 nvidiafb,mgag200
drm_kms_helper 311296 4 mgag200,nvidia_drm
drm 622592 5 drm_kms_helper,nvidia,mgag200,nvidia_drm
lspci | grep NVIDIA
03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
nvidia-smi
No devices were found
I tried different ways to install the driver but when I ask for nvidia-smi I always get No devices found.
Could you please help?
sudo dmesg | grep nvidia
[ 11.925467] nvidiafb: Device ID: 10de15f8
[ 11.925473] nvidiafb: unknown NV_ARCH
[ 12.015811] nvidia: loading out-of-tree module taints kernel.
[ 12.015827] nvidia: module license 'NVIDIA' taints kernel.
[ 12.074954] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 12.097370] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 12.236719] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.86.10 Wed Jul 26 23:01:50 UTC 2023
[ 12.242512] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 12.242514] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 12.409378] audit: type=1400 audit(1691342857.950:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1573 comm="apparmor_parser"
[ 12.409403] audit: type=1400 audit(1691342857.950:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1573 comm="apparmor_parser"
[ 531.056647] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[ 531.106298] nvidia-modeset: Unloading
[ 531.139051] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[ 546.540595] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 546.693604] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 546.699398] nvidia-uvm: Loaded the UVM driver, major device number 506.
[ 546.705225] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.86.10 Wed Jul 26 23:01:50 UTC 2023
[ 546.708939] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 546.708942] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 546.714750] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[ 546.752891] nvidia-modeset: Unloading
[ 546.780644] nvidia-uvm: Unloaded the UVM driver.
[ 546.809003] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[ 562.200098] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[ 562.322454] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 535.86.10 Wed Jul 26 23:01:50 UTC 2023
[ 562.325817] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[ 562.325819] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[ 747.837871] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[ 747.844925] nvidia-uvm: Loaded the UVM driver, major device number 506.
- nvidia: module verification failed - indicates signed driver issue
- nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
- nvidia-uvm: Loaded the UVM driver
So the core nvidia driver and UVM/modeset modules are loading.
The signature verification failed warning can likely be ignored for now.
Why nvidia-smi still doesn’t detect the GPU if the modules are clearly loading??? Does anyone has any clue???
I already enabled" Memory Mapped I/O above 4 GB” on BIOS settings.
I disabled “Embedded Video Controller” on BIOS settings
The card is plugged into slot 6 on Riser 3 with 8 pin cable (For DELL R730 8pin to 8pin Power Cable Nvidia K80/M40/M60/P40/P100 PCIE GPU @USA | eBay).
For people who will later face the same problems:
FAN SPEED
Enable manual fan control:
sudo ipmitool raw 0x30 0x30 0x01 0x00
For example, to set the fan speed to 20%:
sudo ipmitool raw 0x30 0x30 0x02 0xff 0x14
Monitor fan speed:
sudo ipmitool sdr type fan
Monitor temperature:
sudo ipmitool sdr type temperature
I solved the problem:
It was basically cable based.
I ended up buying a lot of cables.
Unfortunately, there are a lot of cables that the sellers were selling as if they would work for R730 - Tesla 100 16 GB.
NO!!!–They lie about the product. Believe me, I tried!
Beware!!!
I ended but contacting the following seller from eBay.
professionalrecycling402
This seller has his youtube channel dedicated to computers and his cats: Computers Cats and More
And here he is identifying the R series Dell server riser pin power configuration
He prepared and sold me 8 pin cable that really works for R series Dell servers!
Forget the cables on Amazon or other shops on eBay.
Get this one:
https://www.ebay.com/itm/225618308661
Secondly, after the installation of the cable, I had the following error:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Here is how I solved it:
#get rid of everything nvidia
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
sudo apt-get install linux-headers-$(uname -r)
reboot your server
sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update
sudo apt install nvidia-driver-470
Done:
nvidia-smi
Fri Aug 18 22:11:35 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:03:00.0 Off | 0 |
| N/A 37C P0 28W / 250W | 0MiB / 16280MiB | 1% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I’ll be happy to answer your questions!
Cheers!
1 Like
Hello @cemdede and welcome to the NVIDIA developer forums!
Thank you for this detailed problem/solution post!
I was out for a while so I missed it earlier.
A good indicator for issues with power supply to the GPU are also dmesg
warnings with text like GPU has fallen off the bus
. As a general guideline in these situations it is very helpful to attach the resulting output file of nvidia-bug-report.sh
. That includes a lot of helpful data on the GPU and how the system tries to load it.
The ppa
repository is a good choice for custom driver installations. Please remember though to disable automatic (driver) updates from Ubuntu, otherwise it will not only replace the current driver, but also miss some of the ppa
based installation files and mess up your driver completely. Any future updates should be done using ppa
as well and manual.
Great to hear you are up and running now!
1 Like