Hello, I hope someone can help me. I have spent several days on this issue with no luck. I am trying to install the cuda toolkit for a tesla p100 gpu. My hardware is as follows:
Motherboard: tyan s8225 Motherboards S8225 S8225AGM4NRF
Cpu: amd opteron 4284 (x2)
Ram: 128gb (16x8) ecc ddr3 1333 mhz
The motherboard works great in ubuntu with geforce graphics (i’ve tested a 2060, 3060ti, 3060 12gb, 2080ti). I’ve also tested the 2080ti in centos, and it works fine as well. Never had an issue installing the cuda toolkit or using tensorflow etc. But for some reason, I absolutely cannot get a p100 to communicate with nvidia-smi.
I’ve tried centos stream 9, centos stream 8, and now I’m on centos stream 7.
I feel like I must be doing something wrong or I must be missing something. I am using centos 7 “workstation” install, with all the optional dependencies. I have secure boot disabled and verified gcc 4. SSH is enabled.
lspci | grep nvidia returns
06:00.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
After I boot up the system, I blacklist nouveau with:
#!/bin/bash
if [[ $EUID -ne 0 ]]; then
echo "This script must be run as root."
exit 1
fi
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut -v /boot/initramfs-$(uname -r).img $(uname -r)
dracut -f
sudo reboot
I then switch to run level 3
sudo init 3
And install cuda with:
#!/bin/bash
sudo yum update -y
sudo yum install -y kernel-devel-$(uname -r) kernel-headers-$(uname -r)
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install -y nvidia-driver-latest-dkms.x86_64
sudo yum install -y cuda
sudo yum install -y cuda-drivers
sudo reboot
I verify the driver install worked fine, by reading the /var/log/nvidiainstaller and testing
nvcc --version
but for some reason,
nvidia-smi
returns: No devices found.
Does anyone have any advice? I’m using the network install because the run file gives me an error with the nvidia driver install: missing kernel module: “nvidia.ko”. I’ve tried other places but I cannot find much documentation about the p100.
I’m trying to get these cards working to training deep learning models for medical research.
Thanks in advance