EC2 Ubuntu 18.04 LTS P3.8xlarge CUDA install with Tesla V100 `nvidia-smi` fails, drivers cannot install as no recognised device exists

Hello,

I am trying to install CUDA on an EC2 P3.8xlarge Ubuntu 18.04 LTS instance following the instructions Amazon has laid out and other guides around when those didn’t work.

I cannot get this to work and I’ve spent about 8 hours doing this so far.

Whenever I get to nvidia-smi I get NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Whenever I try and install the driver through a .run file with the 8-bit interface a warning comes up first giving WARNING: You do not appear to have an NVIDIA GPU supported by the 410.79 NVIDIA Linux graphics driver installed in this system. For further details, please see the appendix SUPPORTED NVIDIA GRAPHICS CHIPS in the README available on the Linux driver download page at www.nvidia.com.

How can I install CUDA? This is driving me mad.

AWS provides instructions here:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html

The problem with nvidia-smi not working may result if you have not properly cleaned out any old driver software. This may be a function of whatever AMI you are starting with. The above instructions give suggestions for how to clean out old driver installs for some situations (i.e. some starting AMIs). Otherwise you can follow the “handle conflicting installs” information in the NVIDIA CUDA linux install guide:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#handle-uninstallation

I’m not sure what you mean by the “8-bit interface”. I guess you are referring to the character mode of the runfile driver installer. I’m not sure how to interpret that message. You may wish to confirm that there are NVIDIA GPUs in your instance, e.g. with the lspci command.

Hmm, well I suppose we should deal with the one that seems like the most base-level first.

Running lspci gives me this

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

That “8-bit” interface is the one that shows up when I run

sudo /bin/sh ./NVIDIA-Linux-x86_64-410.79.run

This is an ec2 p3.8xlarge so I don’t understand why there are no GPUs listed there.

If there are no GPUs listed by lspci in your instance, then it’s expected that the driver installer will report: WARNING: You do not appear to have an NVIDIA GPU supported by the 410.79

I don’t find any similar reports on the web, so my guess is that you are not actually connecting to a p3.8xlarge instance.

I can’t see I am not. I create the spot request, then wait for it to be fulfilled and go to running instances and then connect to it via the connect button pasting in the ssh command it gives. That’s bizarre…

You may wish to check your spot instance limits, or contact AWS for help.

I’m sorry to ask this of you but where can I find what my spot instance limits are?

If you simply google “aws spot instance limits” you’ll find many good resources. Here is one:

https://stackoverflow.com/questions/43217134/aws-spot-instance-limit

I’m not suggesting this is definitely the issue. I don’t think it is. I think there is something else going on, not sure what, and that AWS support might be your quickest resolution.