System hangs when issuing "nvidia-smi -q" command, after installing CUDA

I have a GPU VM instance created in Amazon AWS EC2 cloud. I followed these instructions for installing the latest Nvidia driver:

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html

Using this selection for Amazon G2:
Product Type GRID
Product Series GRID Series
Product GRID K520
Operating System Linux 64-bit
Recommended/Beta Recommended/Certified

Then I installed CUDA 7.5:
https://developer.nvidia.com/cuda-downloads

Linux x86_64 CentOS 7 rpm(local)
cuda-repo-rhel7-7-5-local-7.5-18.x86_64.rpm
No errors.

After installation, I ran:
$ nvidia-smi -q
This hung the VM immediately. After a hard reset via the Amazon Management console, the VM came back up. I tried again, and it would still hang.
I de-installed Nvidia drivers and CUDA, and decided to try again.

This time, after installing Nvidia drivers, I ran the ‘nvidia-smi -q’ command. No problem. It returned the results quickly, no hung VM.
I then installed CUDA 7.0, thinking maybe 7.5 was the problem. After installing CUDA 7.0, I ran the ‘nvidia-smi -q’ command. This hung the VM.
So, it would appear there is some problem with the CUDA installation.

I read a post somewhere about “GPU Persistent Mode”, and tried:

nvidia-smi -pm 1

After setting this, the ‘nvidia-smi -q’ command wouldn’t hang the VM, but the command itself would never return an output and I couldn’t kill the process. I could ssh into the VM from another terminal without any issue, and view any errors, but didn’t see any.
I’m not sure if this ‘fix’ is relevant; there still seems to be a problem since the command doesn’t return any output.

Anything else I can try or look for problems?

Any help/tips would be appreciated.

Thanks!

You’ll need to update to a newer driver than what is in the CUDA 7.5 package.

https://devtalk.nvidia.com/default/topic/880246/cuda-setup-and-installation/cuda-7-5-unstable-on-ec2-/

I don’t understand the part about ‘the nvidia driver that in the CUDA 7.5 package’. I manually installed the Nvidia driver, then installed CUDA 7.5. Is this not correct?

I just now tried replacing my Nvidia 367.27 with the version listed at the end of the thread you provided: 361.45.11

The problem remains. After installing the Nvidia driver, I can run:

nvidia-smi -q

But if I then proceed installing CUDA 7.5, the same ‘nvidia-smi -q’ command hangs.

Any other ideas? Logs I can provide?

Thanks!

Install CUDA 7.5. Then install the driver 361.45 driver (or 367.27 should work also).

Installing CUDA 7.5 after installing the driver wipes out the driver and replaces it with an older one.

The package manager method can make this difficult.

My suggestion would be:

  1. start over with a clean OS load
  2. Follow the instructions for “runfile installer method” in the cuda 7.5 linux install guide. Don’t skip any steps or fail to remove nouveau. Follow the instructions carefully.
  3. Select “no” when prompted to install the driver after launching the CUDA 7.5 runfile installer.
  4. Download the 361.45 driver runfile installer.
  5. Install that driver.
  6. Profit.

That process worked. Thanks for the help!

The part of the process which was giving me grief is here:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using_cluster_computing.html

Where it states:
You must reinstall the CUDA toolkit after installing the NVIDIA driver.

Which was blowing away the original Nvidia driver again.

Normally, that works. But there are two additional considerations in this case:

  1. The driver bundled with the CUDA 7.5 runfile installer (352.39) does not work correctly on EC2 instances - as indicated in the other thread I linked. You have to use a newer driver.
  2. If using the runfile installer method, the actual order does not really matter (driver, then toolkit, or toolkit, then driver) as long as you deselect the option to install the driver when installing the toolkit. In other words, in my instruction sequence, you could have switched steps 4,5 with step 3 (thus effectively matching the AWS instructions) as long as you deselect the driver install as part of the toolkit install process.

Again, this doesn’t describe how to perform a similar operation using the package manager method. That exercise is left to the reader, given the above understanding of the issue.