Need Help with P100 installation (R730 Dell)

cemdede · August 6, 2023, 6:37pm

Hi,

I cannot make this one work:
I have Dell R730, which works on Ubuntu 22.04.3.
I used Riser 3 and added a P100.
I enabled BIOS GPU Legacy settings as well

I used NVIDIA cuda developer website to download the driver and used

sudo sh cuda_12.2.1_535.86.10_linux.run --no-opengl-libs

in order to prevent from installing OPENGL libraries

 nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

In the end:

lsmod | grep nvidia

nvidia_drm             77824  0
nvidia_modeset       1302528  1 nvidia_drm
nvidia              56532992  1 nvidia_modeset
nvidiafb               61440  0
vgastate               24576  1 nvidiafb
fb_ddc                 16384  1 nvidiafb
i2c_algo_bit           16384  2 nvidiafb,mgag200
drm_kms_helper        311296  4 mgag200,nvidia_drm
drm                   622592  5 drm_kms_helper,nvidia,mgag200,nvidia_drm

lspci | grep NVIDIA

03:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)

nvidia-smi

No devices were found

I tried different ways to install the driver but when I ask for nvidia-smi I always get No devices found.
Could you please help?

cemdede · August 6, 2023, 6:52pm

sudo dmesg | grep nvidia

[   11.925467] nvidiafb: Device ID: 10de15f8 
[   11.925473] nvidiafb: unknown NV_ARCH
[   12.015811] nvidia: loading out-of-tree module taints kernel.
[   12.015827] nvidia: module license 'NVIDIA' taints kernel.
[   12.074954] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   12.097370] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[   12.236719] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[   12.242512] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[   12.242514] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[   12.409378] audit: type=1400 audit(1691342857.950:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1573 comm="apparmor_parser"
[   12.409403] audit: type=1400 audit(1691342857.950:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1573 comm="apparmor_parser"
[  531.056647] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[  531.106298] nvidia-modeset: Unloading
[  531.139051] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[  546.540595] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[  546.693604] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[  546.699398] nvidia-uvm: Loaded the UVM driver, major device number 506.
[  546.705225] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[  546.708939] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[  546.708942] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[  546.714750] [drm] [nvidia-drm] [GPU ID 0x00000300] Unloading driver
[  546.752891] nvidia-modeset: Unloading
[  546.780644] nvidia-uvm: Unloaded the UVM driver.
[  546.809003] nvidia-nvlink: Unregistered Nvlink Core, major device number 508
[  562.200098] nvidia-nvlink: Nvlink Core is being initialized, major device number 508
[  562.322454] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.86.10  Wed Jul 26 23:01:50 UTC 2023
[  562.325817] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[  562.325819] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1
[  747.837871] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[  747.844925] nvidia-uvm: Loaded the UVM driver, major device number 506.

nvidia: module verification failed - indicates signed driver issue
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
nvidia-uvm: Loaded the UVM driver

So the core nvidia driver and UVM/modeset modules are loading.
The signature verification failed warning can likely be ignored for now.

Why nvidia-smi still doesn’t detect the GPU if the modules are clearly loading??? Does anyone has any clue???

rs277 · August 7, 2023, 12:43am

I have no experience with the R730, but a couple of things to check looking at the manual - hopefully not too obvious.

In the BIOS, under Integrated Devices, ensure “Memory Mapped I/O above 4 GB” is enabled.

The P100 needs an 8 pin CPU connector plugged into the rear. The manual states that when using GPU’s on a riser, the 8 pin connector should be connected to the riser. It’s not immediately obvious how that power gets to the P100. Does the riser have an 8pin cable that then connects to the rear of the P100.

If the P100 does not have an 8pin power connection, it won’t function correctly.

cemdede · August 7, 2023, 12:50am

Thank you for your answers.
I already enabled" Memory Mapped I/O above 4 GB” on BIOS settings.
I disabled “Embedded Video Controller” on BIOS settings
The card is plugged into slot 6 on Riser 3 with 8 pin cable (For DELL R730 8pin to 8pin Power Cable Nvidia K80/M40/M60/P40/P100 PCIE GPU @USA | eBay).

cemdede · August 7, 2023, 12:53am

For people who will later face the same problems:
FAN SPEED

Enable manual fan control:
sudo ipmitool raw 0x30 0x30 0x01 0x00

For example, to set the fan speed to 20%:
sudo ipmitool raw 0x30 0x30 0x02 0xff 0x14

Monitor fan speed:
sudo ipmitool sdr type fan

Monitor temperature:
sudo ipmitool sdr type temperature

cemdede · August 14, 2023, 3:20pm

The computer is a Dell R730, and it runs on Ubuntu 22.04.3 LTS Server, I tried an 8pin and 16 pin Risers 3 for this Tesla 8pin P100 16GB. I even added 2x 1100W power supplies.

Here is the power-related info from NVIDIA about Tesla P100: Link

The Riser pin itself supplies not more than 75W, I guess, and the 8-pin riser power outlet gives out 12 V.

The problem is; Dell does not want to support solutions on the forum page, but people are using these cards:

The cable that was mentioned on the forum sides is a 2x male 8pin cable, but since it is a cheap knockout, it does not work as intended although the descriptions clearly dictate it works.

For driver, all I could get after trying so many drivers from NVIDIA:
This one seems to work better in terms of error message detailing.

NVIDIA-Linux-x86_64-460.106.00.run

Then, here are a couple of things you might want to know:

Yes, the Ubuntu kernel and Driver kernel are aligning:

uname -r

5.15.0-78-generic

modinfo nvidia | grep vermagic

vermagic: 5.15.0-78-generic SMP mod_unload modversions

The P100 does not seem to run due to powering issues, but this is tricky since I saw previously on the forums that some of the incompatible drivers also cause these issues as well .

nvidia-smi

Unable to determine the device handle for GPU 0000:03:00.0: Unable to communicate with GPU because it is insufficiently powered.

This may be because not all required external power cables are

attached, or the attached cables are not seated properly.

sudo lshw -C display

*-display

description: 3D controller

product: GP100GL [Tesla P100 PCIe 16GB]

vendor: NVIDIA Corporation

physical id: 0

bus info: [pci@0000:03:00.0](mailto:pci@0000:03:00.0)

logical name: /dev/fb0

version: a1

width: 64 bits

clock: 33MHz

capabilities: pm msi pciexpress bus_master cap_list fb

configuration: depth=32 driver=nvidia latency=0 mode=1440x900 visual=truecolor xres=1440 yres=900

resources: iomemory:3b80-3b7f iomemory:3bc0-3bbf irq:196 memory:91000000-91ffffff memory:3b800000000-3bbffffffff memory:3bc00000000-3bc01ffffff

*-display

description: VGA compatible controller

product: G200eR2

vendor: Matrox Electronics Systems Ltd.

physical id: 0

bus info: [pci@0000:09:00.0](mailto:pci@0000:09:00.0)

logical name: /dev/fb0

version: 01

width: 32 bits

clock: 33MHz

capabilities: pm vga_controller bus_master cap_list rom fb

configuration: depth=32 driver=mgag200 latency=64 maxlatency=32 mingnt=16 resolution=1440,900

resources: irq:19 memory:90000000-90ffffff memory:92800000-92803fff memory:92000000-927fffff memory:c0000-dffff

lsmod | grep nvidia

nvidia_drm 65536 0

nvidia_modeset 1228800 1 nvidia_drm

nvidia 34197504 1 nvidia_modeset

drm_kms_helper 311296 4 mgag200,nvidia_drm

drm 622592 4 drm_kms_helper,mgag200,nvidia_drm

sudo dmesg | grep nvidia

[ 11.501997] nvidia: loading out-of-tree module taints kernel.

[ 11.502016] nvidia: module license 'NVIDIA' taints kernel.

[ 11.525096] nvidia: module verification failed: signature and/or required key missing - tainting kernel

[ 11.611845] nvidia-nvlink: Nvlink Core is being initialized, major device number 508

[ 11.753382] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 460.106.00 Tue Sep 28 11:57:18 UTC 2021

[ 11.775569] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver

[ 11.775572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 1

[ 12.033643] audit: type=1400 audit(1692022978.423:2): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1607 comm="apparmor_parser"

[ 12.033648] audit: type=1400 audit(1692022978.423:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1607 comm="apparmor_parser"

I guess it says that the kernel was unable to verify the digital signature of the NVIDIA module. I don’t know if this is due to the card not properly being fed or if I was not able to install the driver right.

I already enabled" Memory Mapped I/O above 4 GB” on BIOS settings.
I disabled “Embedded Video Controller” on BIOS settings.

For the possibility of Nouveau Driver ConflictI did:

sudo nano /etc/modprobe.d/blacklist-nouveau.conf

and added:

blacklist nouveau
options nouveau modeset=0

During my previous driver installation tries, I successfully installed Cuda multiple times.

But this time, I haven’t installed it, so I cannot give you the output of nvcc -V

Every time I wanted to re-install the driver, I went back and everything about Nvidia.

sudo apt-get purge nvidia-*

I even tried auto driver installation as well, but no use.

So, I can wholeheartedly say I tried many ways. But I was not able to make it work, so I’m all open to any ideas.

Thank you very much for your time and help in advance!!!

rs277 · August 14, 2023, 7:36pm

The error message, “Unable to determine the device handle for GPU 0000:03:00.0: Unable to communicate with GPU because it is insufficiently powered.”, can occur when not all 12V pins on the GPU socket are connected.

The wiring diagram, Figure 4 on the P100 PDF you linked clearly shows all four pins, 5 - 8 need 12V.

The cable in the Ebay link only has 3 pins connected (shown in photo 2) and an extra GND wire fitted. It’s quite possible that this GND is putting a short on the PSU and causing it to shut down the +12V supply to the card. You may want to measure voltages from the back of the GPU connector with it all connected, to confirm.

I offer this advice as is and accept no responsibility for possible outcomes.

cemdede · August 14, 2023, 8:52pm

Thank you very much for your reply.

cemdede · August 18, 2023, 11:34pm

I solved the problem:

It was basically cable based.
I ended up buying a lot of cables.
Unfortunately, there are a lot of cables that the sellers were selling as if they would work for R730 - Tesla 100 16 GB.
NO!!!–They lie about the product. Believe me, I tried!
Beware!!!

I ended but contacting the following seller from eBay.
professionalrecycling402

This seller has his youtube channel dedicated to computers and his cats: Computers Cats and More

And here he is identifying the R series Dell server riser pin power configuration

He prepared and sold me 8 pin cable that really works for R series Dell servers!

Forget the cables on Amazon or other shops on eBay.
Get this one:

https://www.ebay.com/itm/225618308661

Secondly, after the installation of the cable, I had the following error:

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Here is how I solved it:
#get rid of everything nvidia

sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'

sudo apt-get install linux-headers-$(uname -r)

reboot your server

sudo add-apt-repository ppa:graphics-drivers/ppa --yes
sudo apt update
sudo apt install nvidia-driver-470

Done:

nvidia-smi

Fri Aug 18 22:11:35 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   37C    P0    28W / 250W |      0MiB / 16280MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I’ll be happy to answer your questions!

Cheers!

Topic		Replies	Views
Need Help with P100 installation (R730 Dell) GPU - Hardware	3	1683	August 21, 2023
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	386	September 11, 2024
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	62205	February 14, 2021
Ubuntu 22.04.3 LTS Server, Tesla P100, Driver Version: 470.199.02, CUDA Version: 11.4 CUDA Setup and Installation	3	3203	August 19, 2023
Tesla P100 Issue – Processing Stops at 8MiB, Multiple Driver Versions Tested nvc, nvc++ and nvfortran cuda	9	130	December 19, 2024
'No devices were found' after installing cuda 11.02 on Ubuntu 20.04 for RTX3080 Linux cuda , ubuntu , driver	19	12546	July 31, 2021
"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" Ubuntu 16.04 CUDA Setup and Installation	79	371418	March 19, 2021
Tesla P40 in Dell Percision 7910 rack CUDA Programming and Performance	10	2128	February 16, 2024
not able to update Tesla P100 driver 384 to 418 Linux	119	5130	November 12, 2019
P100 not showing up in nvidia-smi CUDA Setup and Installation	17	8904	November 20, 2022

Need Help with P100 installation (R730 Dell)

Related topics