The Nvidia Grid drivers on EC2 Ubuntu machines are constantly lost

We have several EC2 instances, each running on Ubuntu 20.04.4 LTS with Nvidia Grid drivers installed using AWS instruction. All the machines have the same problem: things work well for a time, although the nvidia-gridd.service still complains about the unset ServerAddress (Not necessary according to AWS instruction with a FeatureType being 0), but after while the drivers seem to be lost and nvidia-gridd.service dies with the following message: Failed to initialize RM Client. Failed to initialize RM Client. Failed to unlock PID file: Bad file descriptor. Failed to close PID file: Bad file descriptor.

The temporary solution discovered was to download the latest driver all over again. After a fresh installation, things seem to work again for some time, however, it is not a long-term solution and we cannot migrate it to production environment. Any help is greatly appreciated.

Here is what we have tried:

  1. We tried different versions of this driver to see if it’s a compatibility issue, but the result is same: it works for some time, and then driver gets lost at some point.

  2. sudo sh NVIDIA-Linux-x86_64*.run --dkms

Verifying archive integrity… OK

Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.73.08…

./nvidia-installer: invalid option: “NVIDIA-Linux-x86_64-510.85.02-grid-aws.run”

This is the AWS instruction followed:
sudo apt-get update -y

sudo apt-get upgrade -y linux-aws

sudo apt-get install awscli -y

sudo reboot

sudo apt-get install -y gcc make linux-headers-$(uname -r)

cat << EOF | sudo tee --append /etc/modprobe.d/blacklist.conf

blacklist vga16fb

blacklist nouveau

blacklist rivafb

blacklist nvidiafb

blacklist rivatv

EOF

sudo vi /etc/default/grub

—----------------------------------------------------------------------------------

GRUB_CMDLINE_LINUX=“rdblacklist=nouveau”

—----------------------------------------------------------------------------------

sudo update-grub

aws configure

aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/latest/ .

aws s3 ls --recursive s3://ec2-linux-nvidia-drivers/

chmod +x NVIDIA-Linux-x86_64*.run

sudo /bin/sh ./NVIDIA-Linux-x86_64*.run

try:

sudo /bin/sh ./NVIDIA-Linux-x86_64*.run -s

sudo reboot

nvidia-smi -q | head

Activate NVIDIA GRID

sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

sudo nano /etc/nvidia/gridd.conf

—----------------------------------------------------------------------------------

FeatureType=0

IgnoreSP=TRUE

—----------------------------------------------------------------------------------

sudo reboot

nvidia-bug-report.log.gz (46.1 KB)

You’re using a * in commandline so any options including --dkms get lost so the driver gets lost on kernel upgrades.

    sudo sh NVIDIA-Linux-x86_64*.run --dkms
Verifying archive integrity… OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 510.73.08…
./nvidia-installer: invalid option: “NVIDIA-Linux-x86_64-510.85.02-grid-aws.run”

Please call it by its full name
sudo sh NVIDIA-Linux-x86_64-510.85.02-grid-aws.run --dkms

Thank you. I hope the dkms option will make this work for long-term :)

You can check
dkms status
this should then display the nvidia driver version and your kernel version as “installed”.

Yes, it does. Thank you very much :)