Accelerator not found: EC2 p2.xlarge, PGI Community Edition

Hi,
I am using the PGI 18.10 Community Edition AMI on AWS EC2 p2.xlarge instances.

When I run pgaccelinfo -v, I get:

CUDA Driver Version:           10000
could not initialize CUDA runtime, error code=100
No accelerators found.
Check the permissions on your CUDA device

I have tried to follow the instructions listed in this thread
https://www.pgroup.com/userforum/viewtopic.php?p=7680&sid=3326ee9998a6dff2e706a7190f348106
It can’t find the nvidia module when I run

modprobe -v nvidia

but I assume that this post is outdated anyway.

What I did:
In order to save some money while installing other dependencies and doing other work I’ve started the instance multiple times as t2.micro which does not have an accelerator. This worked fine and after changing the type back to p2.xlarge it always ‘just worked’ - until it didn’t. Not sure if this might have caused this error, but my guess is that it has nothing to do with it.

lspci shows the K80 to be connected.

Any ideas on how to solve this?

Thanks,
Daniel

Hi Daniel,

I’m thinking that this is more a system issue rather than a PGI issue but let’s see if we can diagnose the problem.

could not initialize CUDA runtime, error code=100

Error 100 indicates that there’s no device. Can you try running “nvidai-smi” to see if it recognizes the device?

Perhaps you got a bad node or Amazon changed the configuration on the device so the permissions are incorrect? Either way, you’ll want to contact Amazon.

Note that I just logged into a p2-xlarge system and it worked fine. So I suspect the issue was with a particular node.

-Mat

Hi Mat,
thanks for your response.

So, I’ve run nvidia-smi and it just said to check the driver.

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

I’ve now created a new p2.xlarge instance and attached the volume from the original instance. I am guessing that this should probably result in running on a different node. Well, doing this didn’t help.

I’ll contact AWS about this. In the meantime it’s probably the best to just start over with a completely fresh instance and image. Although, do you have any idea on what might have caused this (assuming it’s not a problem with aws). Would be pretty annoying if it happens again.

Daniel

Although, do you have any idea on what might have caused this

Since nvidia-smi failed, my best guess is that it’s a bad device, the device needs to be reset, a permission issue with the device, or a problem with the CUDA driver.

I’ll contact AWS about this.

Sound good. Unfortunately there’s not much we can do here if there’s a hardware or system issue.

-Mat