Trying to install CUDA and nvidia container toolkit on a EC2 RHEL 8.10 instance. All the commands work as expected however at the end when i try to run nvidia-smi it gives an error “NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running”
Verified on Driver Details | NVIDIA that i am installing the correct driver and CUDA version.
Ran the nvidia-bug report and in that i see entries of “NVRM: The NVIDIA GPU 0000:00:1e.0 (PCI ID: 10de:1db1)#012NVRM: installed in this system is not supported by open#012NVRM: nvidia.ko because it does not include the required GPU#012NVRM: System Processor (GSP).#012NVRM: Please see the ‘Open Linux Kernel Modules’ and ‘GSP#012NVRM: Firmware’ sections in the driver README, available on#012NVRM: the Linux graphics driver download page at#012NVRM: www.nvidia.com.”
“NVRM: None of the NVIDIA devices were initialized”
Can anybody suggest what is missing here?
The instance type is p3.2xlarge
For some reason you have installed the Open Source kernel module version on your system. Either by chosing it during installation or during a previous install. I cannot tell you why AWS does not support it, that would be for AWS customer support to answer.
But you can try to install closed source driver instead. Best start from a clean system to not keep any old kernel modules.
Thank you Markus, i ran below commands. Would you mind telling me which of these was the culprit? Also, can the GSP be causing it and would it help i disable it?
https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gsp.html
sudo yum install kernel kernel-tools kernel-headers kernel-devel
sudo reboot
sudo yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo sed ‘s/$releasever/8/g’ -i /etc/yum.repos.d/epel.repo
sudo sed ‘s/$releasever/8/g’ -i /etc/yum.repos.d/epel-modular.repo
sudo yum config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo yum install cuda
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo |
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum-config-manager --enable nvidia-container-toolkit-experimental
sudo yum install -y nvidia-container-toolkit
I actually have another EC2 instance of type g4dn.4xlarge, RHEL8.9 and the same commands work as expected there. nvidia-smi doesn’t report any issues and even the GSP is enabled as well.
May be a AMI specific setting causes it