We have several EC2 instances, each running on Ubuntu 20.04.4 LTS with Nvidia Grid drivers installed using AWS instruction. All the machines have the same problem: things work well for a time, although the nvidia-gridd.service still complains about the unset ServerAddress (Not necessary according to AWS instruction with a FeatureType being 0), but after while the drivers seem to be lost and nvidia-gridd.service dies with the following message: Failed to initialize RM Client. Failed to initialize RM Client. Failed to unlock PID file: Bad file descriptor. Failed to close PID file: Bad file descriptor.
The temporary solution discovered was to download the latest driver all over again. After a fresh installation, things seem to work again for some time, however, it is not a long-term solution and we cannot migrate it to production environment. Any help is greatly appreciated.
Here is what we have tried:
-
We tried different versions of this driver to see if it’s a compatibility issue, but the result is same: it works for some time, and then driver gets lost at some point.
-
sudo sh NVIDIA-Linux-x86_64*.run --dkms
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.73.08…
./nvidia-installer: invalid option: “NVIDIA-Linux-x86_64-510.85.02-grid-aws.run”
This is the AWS instruction followed:
sudo apt-get update -y
sudo apt-get upgrade -y linux-aws
sudo apt-get install awscli -y
sudo reboot
sudo apt-get install -y gcc make linux-headers-$(uname -r)
cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
EOF
sudo vi /etc/default/grub
—----------------------------------------------------------------------------------
GRUB_CMDLINE_LINUX=“rdblacklist=nouveau”
—----------------------------------------------------------------------------------
sudo update-grub
aws configure
aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .
aws s3 ls --recursive s3://ec2-linux-nvidia-drivers/
chmod +x NVIDIA-Linux-x86_64*.run
sudo /bin/sh ./NVIDIA-Linux-x86_64*.run
try:
sudo /bin/sh ./NVIDIA-Linux-x86_64*.run -s
sudo reboot
nvidia-smi -q | head
Activate NVIDIA GRID
sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
sudo nano /etc/nvidia/gridd.conf
—----------------------------------------------------------------------------------
FeatureType=0
IgnoreSP=TRUE
—----------------------------------------------------------------------------------
sudo reboot
nvidia-bug-report.log.gz (46.1 KB)