EC2 Ubuntu 18.04 LTS P3.8xlarge CUDA install with Tesla V100 `nvidia-smi` fails, drivers cannot install as no recognised device exists

thetrollwrangler · January 14, 2019, 4:06pm

Hello,

I am trying to install CUDA on an EC2 P3.8xlarge Ubuntu 18.04 LTS instance following the instructions Amazon has laid out and other guides around when those didn’t work.

I cannot get this to work and I’ve spent about 8 hours doing this so far.

Whenever I get to nvidia-smi I get NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Whenever I try and install the driver through a .run file with the 8-bit interface a warning comes up first giving WARNING: You do not appear to have an NVIDIA GPU supported by the 410.79 NVIDIA Linux graphics driver installed in this system. For further details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the README available on the Linux driver download page at www.nvidia.com.

How can I install CUDA? This is driving me mad.

Robert_Crovella · January 14, 2019, 4:16pm

AWS provides instructions here:

[url]https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html[/url]

The problem with nvidia-smi not working may result if you have not properly cleaned out any old driver software. This may be a function of whatever AMI you are starting with. The above instructions give suggestions for how to clean out old driver installs for some situations (i.e. some starting AMIs). Otherwise you can follow the “handle conflicting installs” information in the NVIDIA CUDA linux install guide:

[url]Installation Guide Linux :: CUDA Toolkit Documentation

I’m not sure what you mean by the “8-bit interface”. I guess you are referring to the character mode of the runfile driver installer. I’m not sure how to interpret that message. You may wish to confirm that there are NVIDIA GPUs in your instance, e.g. with the lspci command.

thetrollwrangler · January 14, 2019, 4:22pm

Hmm, well I suppose we should deal with the one that seems like the most base-level first.

Running lspci gives me this

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

That “8-bit” interface is the one that shows up when I run

sudo /bin/sh ./NVIDIA-Linux-x86_64-410.79.run

This is an ec2 p3.8xlarge so I don’t understand why there are no GPUs listed there.

Robert_Crovella · January 14, 2019, 4:28pm

If there are no GPUs listed by lspci in your instance, then it’s expected that the driver installer will report: WARNING: You do not appear to have an NVIDIA GPU supported by the 410.79

I don’t find any similar reports on the web, so my guess is that you are not actually connecting to a p3.8xlarge instance.

thetrollwrangler · January 14, 2019, 4:35pm

I can’t see I am not. I create the spot request, then wait for it to be fulfilled and go to running instances and then connect to it via the connect button pasting in the ssh command it gives. That’s bizarre…

Robert_Crovella · January 14, 2019, 4:41pm

You may wish to check your spot instance limits, or contact AWS for help.

thetrollwrangler · January 14, 2019, 4:44pm

I’m sorry to ask this of you but where can I find what my spot instance limits are?

Robert_Crovella · January 14, 2019, 5:18pm

If you simply google “aws spot instance limits” you’ll find many good resources. Here is one:

[url]amazon web services - AWS Spot Instance Limit - Stack Overflow

I’m not suggesting this is definitely the issue. I don’t think it is. I think there is something else going on, not sure what, and that AWS support might be your quickest resolution.

Topic		Replies	Views
Nvidia-smi commands fails on AWS EC2 Instance Drivers - Linux, Windows, MacOS cuda , drivers	4	389	October 3, 2024
Unable to install cuda 10.0 on Ubuntu 18.04 on EC2 AWS CUDA Setup and Installation	2	838	July 6, 2022
Accelerator not found: EC2 p2.xlarge, PGI Community Edition Legacy PGI Compilers	3	3061	April 15, 2019
A100 GPU on GCP: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.", "Found no NVIDIA driver on your system." CUDA Setup and Installation cuda , python , linux , driver	0	2188	October 21, 2022
Cuda Installation on Ubuntu 18.04 Failing CUDA Setup and Installation	8	2799	March 26, 2020
Drivers installed, NVIDIA-SMI failed CUDA Setup and Installation	4	1223	September 7, 2018
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver Linux cuda	0	1448	July 1, 2020
CUDA install fail on Amazon Linux: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver." Linux	6	4414	May 14, 2024
CUDA 11.3 installation failure on Ubuntu 18.04 CUDA Setup and Installation	0	1095	May 22, 2021
Cannot install NVIDIA driver in ESXi VM with vGPU NVIDIA Virtual GPU Drivers	6	4872	September 26, 2019

EC2 Ubuntu 18.04 LTS P3.8xlarge CUDA install with Tesla V100 `nvidia-smi` fails, drivers cannot install as no recognised device exists

Related topics