I cannot make this one work:
I have Dell R730, which works on Ubuntu 22.04.3.
I used Riser 3 and added a P100.
I enabled BIOS GPU Legacy settings as well
I used NVIDIA cuda developer website to download the driver and used
sudo sh cuda_12.2.1_535.86.10_linux.run --no-opengl-libs
in order to prevent from installing OPENGL libraries
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
I have no experience with the R730, but a couple of things to check looking at the manual - hopefully not too obvious.
In the BIOS, under Integrated Devices, ensure “Memory Mapped I/O above 4 GB” is enabled.
The P100 needs an 8 pin CPU connector plugged into the rear. The manual states that when using GPU’s on a riser, the 8 pin connector should be connected to the riser. It’s not immediately obvious how that power gets to the P100. Does the riser have an 8pin cable that then connects to the rear of the P100.
If the P100 does not have an 8pin power connection, it won’t function correctly.
The computer is a Dell R730, and it runs on Ubuntu 22.04.3 LTS Server, I tried an 8pin and 16 pin Risers 3 for this Tesla 8pin P100 16GB. I even added 2x 1100W power supplies.
Here is the power-related info from NVIDIA about Tesla P100: Link
The Riser pin itself supplies not more than 75W, I guess, and the 8-pin riser power outlet gives out 12 V.
The problem is; Dell does not want to support solutions on the forum page, but people are using these cards:
The cable that was mentioned on the forum sides is a 2x male 8pin cable, but since it is a cheap knockout, it does not work as intended although the descriptions clearly dictate it works.
For driver, all I could get after trying so many drivers from NVIDIA:
This one seems to work better in terms of error message detailing.
NVIDIA-Linux-x86_64-460.106.00.run
Then, here are a couple of things you might want to know:
Yes, the Ubuntu kernel and Driver kernel are aligning:
The P100 does not seem to run due to powering issues, but this is tricky since I saw previously on the forums that some of the incompatible drivers also cause these issues as well .
nvidia-smi
Unable to determine the device handle for GPU 0000:03:00.0: Unable to communicate with GPU because it is insufficiently powered.
This may be because not all required external power cables are
attached, or the attached cables are not seated properly.
I guess it says that the kernel was unable to verify the digital signature of the NVIDIA module. I don’t know if this is due to the card not properly being fed or if I was not able to install the driver right.
I already enabled" Memory Mapped I/O above 4 GB” on BIOS settings.
I disabled “Embedded Video Controller” on BIOS settings.
For the possibility of Nouveau Driver ConflictI did:
sudo nano /etc/modprobe.d/blacklist-nouveau.conf
and added:
blacklist nouveau options nouveau modeset=0
During my previous driver installation tries, I successfully installed Cuda multiple times.
But this time, I haven’t installed it, so I cannot give you the output of nvcc -V
Every time I wanted to re-install the driver, I went back and everything about Nvidia.
sudo apt-get purge nvidia-*
I even tried auto driver installation as well, but no use.
So, I can wholeheartedly say I tried many ways. But I was not able to make it work, so I’m all open to any ideas.
Thank you very much for your time and help in advance!!!
The error message, “Unable to determine the device handle for GPU 0000:03:00.0: Unable to communicate with GPU because it is insufficiently powered.”, can occur when not all 12V pins on the GPU socket are connected.
The wiring diagram, Figure 4 on the P100 PDF you linked clearly shows all four pins, 5 - 8 need 12V.
The cable in the Ebay link only has 3 pins connected (shown in photo 2) and an extra GND wire fitted. It’s quite possible that this GND is putting a short on the PSU and causing it to shut down the +12V supply to the card. You may want to measure voltages from the back of the GPU connector with it all connected, to confirm.
I offer this advice as is and accept no responsibility for possible outcomes.
It was basically cable based.
I ended up buying a lot of cables.
Unfortunately, there are a lot of cables that the sellers were selling as if they would work for R730 - Tesla 100 16 GB.
NO!!!–They lie about the product. Believe me, I tried!
Beware!!!
I ended but contacting the following seller from eBay. professionalrecycling402
This seller has his youtube channel dedicated to computers and his cats: Computers Cats and More
And here he is identifying the R series Dell server riser pin power configuration
He prepared and sold me 8 pin cable that really works for R series Dell servers!
Forget the cables on Amazon or other shops on eBay.
Get this one:
https://www.ebay.com/itm/225618308661
Secondly, after the installation of the cable, I had the following error:
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Here is how I solved it: #get rid of everything nvidia