Need Help with P100 installation (R730 Dell)

cemdede · August 6, 2023, 5:55pm

Hi,

I cannot make this one work:
I have Dell R730, which works on Ubuntu 22.04.3.
I used Riser 3 and added a P100.
I enabled BIOS GPU Legacy settings - then disabled

I used NVIDIA cuda developer website to download the driver and used

sudo sh cuda_12.2.1_535.86.10_linux.run --no-opengl-libs

in order to prevent from installing OPENGL libraries

nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2021 NVIDIA Corporation

Built on Thu_Nov_18_09:45:30_PST_2021

Cuda compilation tools, release 11.5, V11.5.119

Build cuda_11.5.r11.5/compiler.30672275_0

In the end:

lsmod | grep nvidia

nvidia_drm             77824  0
nvidia_modeset       1302528  1 nvidia_drm
nvidia              56532992  1 nvidia_modeset
nvidiafb               61440  0
vgastate               24576  1 nvidiafb
fb_ddc                 16384  1 nvidiafb
i2c_algo_bit           16384  2 nvidiafb,mgag200
drm_kms_helper        311296  4 mgag200,nvidia_drm
drm                   622592  5 drm_kms_helper,nvidia,mgag200,nvidia_drm

lspci | grep NVIDIA

03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

nvidia-smi

No devices were found

I tried different ways to install the driver but when I ask for nvidia-smi I always get No devices found.
Could you please help?

cemdede · August 6, 2023, 6:50pm

sudo dmesg | grep nvidia

[   11.925467] nvidiafb: Device ID: 10de15f8 
[   11.925473] nvidiafb: unknown NV_ARCH
[   12.015811] nvidia: loading out-of-tree module taints kernel.
[   12.015827] nvidia: module license 'NVIDIA' taints kernel.
[   12.074954] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   12.097370] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   12.236719] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[   12.242512] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[   12.242514] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[   12.409378] audit: type=1400 audit(1691342857.950:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1573 comm="apparmor_parser"
[   12.409403] audit: type=1400 audit(1691342857.950:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1573 comm="apparmor_parser"
[  531.056647] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[  531.106298] nvidia-modeset: Unloading
[  531.139051] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[  546.540595] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[  546.693604] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[  546.699398] nvidia-uvm: Loaded the UVM driver, major device number 506.
[  546.705225] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[  546.708939] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[  546.708942] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[  546.714750] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[  546.752891] nvidia-modeset: Unloading
[  546.780644] nvidia-uvm: Unloaded the UVM driver.
[  546.809003] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[  562.200098] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[  562.322454] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[  562.325817] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[  562.325819] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[  747.837871] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[  747.844925] nvidia-uvm: Loaded the UVM driver, major device number 506.

nvidia: module verification failed - indicates signed driver issue
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
nvidia-uvm: Loaded the UVM driver

So the core nvidia driver and UVM/modeset modules are loading.
The signature verification failed warning can likely be ignored for now.

Why nvidia-smi still doesn’t detect the GPU if the modules are clearly loading??? Does anyone has any clue???

I already enabled" Memory Mapped I/O above 4 GB” on BIOS settings.
I disabled “Embedded Video Controller” on BIOS settings
The card is plugged into slot 6 on Riser 3 with 8 pin cable (For DELL R730 8pin to 8pin Power Cable Nvidia K80/M40/M60/P40/P100 PCIE GPU @USA | eBay).

For people who will later face the same problems:
FAN SPEED

Enable manual fan control:
sudo ipmitool raw 0x30 0x30 0x01 0x00

For example, to set the fan speed to 20%:
sudo ipmitool raw 0x30 0x30 0x02 0xff 0x14

Monitor fan speed:
sudo ipmitool sdr type fan

Monitor temperature:
sudo ipmitool sdr type temperature

cemdede · August 18, 2023, 11:36pm

I solved the problem:

It was basically cable based.
I ended up buying a lot of cables.
Unfortunately, there are a lot of cables that the sellers were selling as if they would work for R730 - Tesla 100 16 GB.
NO!!!–They lie about the product. Believe me, I tried!
Beware!!!
I ended but contacting the following seller from eBay.
professionalrecycling402

This seller has his youtube channel dedicated to computers and his cats: Computers Cats and More

And here he is identifying the R series Dell server riser pin power configuration

He prepared and sold me 8 pin cable that really works for R series Dell servers!

Forget the cables on Amazon or other shops on eBay.
Get this one:

https://www.ebay.com/itm/225618308661

Secondly, after the installation of the cable, I had the following error:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Here is how I solved it:
#get rid of everything nvidia

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'

sudo apt-get install linux-headers-$(uname -r)

reboot your server

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update
sudo apt install nvidia-driver-470

Done:

nvidia-smi

Fri Aug 18 22:11:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   37C    P0    28W / 250W |      0MiB / 16280MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’ll be happy to answer your questions!

Cheers!

MarkusHoHo · August 21, 2023, 9:20am

Hello @cemdede and welcome to the NVIDIA developer forums!

Thank you for this detailed problem/solution post!

I was out for a while so I missed it earlier.

A good indicator for issues with power supply to the GPU are also dmesg warnings with text like GPU has fallen off the bus. As a general guideline in these situations it is very helpful to attach the resulting output file of nvidia-bug-report.sh. That includes a lot of helpful data on the GPU and how the system tries to load it.

The ppa repository is a good choice for custom driver installations. Please remember though to disable automatic (driver) updates from Ubuntu, otherwise it will not only replace the current driver, but also miss some of the ppa based installation files and mess up your driver completely. Any future updates should be done using ppa as well and manual.

Great to hear you are up and running now!

Topic		Replies	Views
Need Help with P100 installation (R730 Dell) CUDA Setup and Installation	8	1870	August 18, 2023
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3398	August 19, 2023
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	579	September 11, 2024
Tesla P40 in Dell Percision 7910 rack CUDA Programming and Performance	10	2324	February 16, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62573	February 14, 2021
P100 not showing up in nvidia-smi CUDA Setup and Installation	17	9057	November 20, 2022
Ubuntu 18 - cuda-drivers-515 - “No devices were found” for Tesla V100 CUDA Setup and Installation ubuntu , driver	4	3210	September 12, 2022
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	181	December 19, 2024
Tesla k40m CUDA Setup and Installation	6	1809	December 8, 2023
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	23134	May 1, 2021

Need Help with P100 installation (R730 Dell)

Related topics