Nvidia-smi commands fails on AWS EC2 Instance

kuldeep.singh1 · October 3, 2024, 9:09am

Trying to install CUDA and nvidia container toolkit on a EC2 RHEL 8.10 instance. All the commands work as expected however at the end when i try to run nvidia-smi it gives an error “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running”
Verified on Driver Details | NVIDIA that i am installing the correct driver and CUDA version.
Ran the nvidia-bug report and in that i see entries of “NVRM: The NVIDIA GPU 0000:00:1e.0 (PCI ID: 10de:1db1)#012NVRM: installed in this system is not supported by open#012NVRM: nvidia.ko because it does not include the required GPU#012NVRM: System Processor (GSP).#012NVRM: Please see the ‘Open Linux Kernel Modules’ and ‘GSP#012NVRM: Firmware’ sections in the driver README, available on#012NVRM: the Linux graphics driver download page at#012NVRM: www.nvidia.com.”
“NVRM: None of the NVIDIA devices were initialized”
Can anybody suggest what is missing here?

kuldeep.singh1 · October 3, 2024, 9:39am

The instance type is p3.2xlarge

MarkusHoHo · October 3, 2024, 12:28pm

kuldeep.singh1:

#012NVRM: The NVIDIA GPU 0000:00:1e.0 (PCI ID: 10de:1db1)
#012NVRM: installed in this system is not supported by open
#012NVRM: nvidia.ko because it does not include the required GPU
#012NVRM: System Processor (GSP).
#012NVRM: Please see the ‘Open Linux Kernel Modules’ and ‘GSP
#012NVRM: Firmware’ sections in the driver README, available on
#012NVRM: the Linux graphics driver download page at
#012NVRM: [www.nvidia.com](http://www.nvidia.com).

For some reason you have installed the Open Source kernel module version on your system. Either by chosing it during installation or during a previous install. I cannot tell you why AWS does not support it, that would be for AWS customer support to answer.

But you can try to install closed source driver instead. Best start from a clean system to not keep any old kernel modules.

kuldeep.singh1 · October 3, 2024, 12:53pm

Thank you Markus, i ran below commands. Would you mind telling me which of these was the culprit? Also, can the GSP be causing it and would it help i disable it?
https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gsp.html

sudo yum install kernel kernel-tools kernel-headers kernel-devel
sudo reboot
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo sed ‘s/$releasever/8/g’ -i /etc/yum.repos.d/epel.repo
sudo sed ‘s/$releasever/8/g’ -i /etc/yum.repos.d/epel-modular.repo
sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo yum install cuda

curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo |
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
sudo yum install -y nvidia-container-toolkit

kuldeep.singh1 · October 3, 2024, 1:09pm

I actually have another EC2 instance of type g4dn.4xlarge, RHEL8.9 and the same commands work as expected there. nvidia-smi doesn’t report any issues and even the GSP is enabled as well.
May be a AMI specific setting causes it

Topic		Replies	Views
Unable to install cuda 10.0 on Ubuntu 18.04 on EC2 AWS CUDA Setup and Installation	2	840	July 6, 2022
Unable to set up cuda-8.0 on RHEL 7.4 CUDA Setup and Installation	5	2024	February 1, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. cuDNN	1	3164	November 30, 2019
A100 GPU on GCP: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.", "Found no NVIDIA driver on your system." CUDA Setup and Installation cuda , python , linux , driver	0	2190	October 21, 2022
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Linux	2	5741	August 16, 2019
Problem with NVIDIA drivers on Ubuntu machine Linux	0	847	March 3, 2020
EC2 Ubuntu 18.04 LTS P3.8xlarge CUDA install with Tesla V100 `nvidia-smi` fails, drivers cannot install as no recognised device exists CUDA Setup and Installation	7	6216	January 14, 2019
CUDA install fail on Amazon Linux: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver." Linux	6	4516	May 14, 2024
NVIDIA-SMI failure on RHEL 8 Linux cuda	0	1551	August 4, 2020
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running Linux	6	1508	February 5, 2024

Nvidia-smi commands fails on AWS EC2 Instance

Related topics