System hangs when issuing "nvidia-smi -q" command, after installing CUDA

joseph85750 · June 16, 2016, 3:06pm

I have a GPU VM instance created in Amazon AWS EC2 cloud. I followed these instructions for installing the latest Nvidia driver:

Using this selection for Amazon G2:
Product Type GRID
Product Series GRID Series
Product GRID K520
Operating System Linux 64-bit
Recommended/Beta Recommended/Certified

Then I installed CUDA 7.5:
CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer

Linux x86_64 CentOS 7 rpm(local)
cuda-repo-rhel7-7-5-local-7.5-18.x86_64.rpm
No errors.

After installation, I ran:
$ nvidia-smi -q
This hung the VM immediately. After a hard reset via the Amazon Management console, the VM came back up. I tried again, and it would still hang.
I de-installed Nvidia drivers and CUDA, and decided to try again.

This time, after installing Nvidia drivers, I ran the ‘nvidia-smi -q’ command. No problem. It returned the results quickly, no hung VM.
I then installed CUDA 7.0, thinking maybe 7.5 was the problem. After installing CUDA 7.0, I ran the ‘nvidia-smi -q’ command. This hung the VM.
So, it would appear there is some problem with the CUDA installation.

I read a post somewhere about “GPU Persistent Mode”, and tried:

nvidia-smi -pm 1

After setting this, the ‘nvidia-smi -q’ command wouldn’t hang the VM, but the command itself would never return an output and I couldn’t kill the process. I could ssh into the VM from another terminal without any issue, and view any errors, but didn’t see any.
I’m not sure if this ‘fix’ is relevant; there still seems to be a problem since the command doesn’t return any output.

Anything else I can try or look for problems?

Any help/tips would be appreciated.

Thanks!

Robert_Crovella · June 17, 2016, 2:37am

You’ll need to update to a newer driver than what is in the CUDA 7.5 package.

[url]https://devtalk.nvidia.com/default/topic/880246/cuda-setup-and-installation/cuda-7-5-unstable-on-ec2-/[/url]

joseph85750 · June 28, 2016, 11:15pm

I don’t understand the part about ‘the nvidia driver that in the CUDA 7.5 package’. I manually installed the Nvidia driver, then installed CUDA 7.5. Is this not correct?

I just now tried replacing my Nvidia 367.27 with the version listed at the end of the thread you provided: 361.45.11

The problem remains. After installing the Nvidia driver, I can run:

nvidia-smi -q

But if I then proceed installing CUDA 7.5, the same ‘nvidia-smi -q’ command hangs.

Any other ideas? Logs I can provide?

Thanks!

Robert_Crovella · June 29, 2016, 1:09am

Install CUDA 7.5. Then install the driver 361.45 driver (or 367.27 should work also).

Installing CUDA 7.5 after installing the driver wipes out the driver and replaces it with an older one.

The package manager method can make this difficult.

My suggestion would be:

start over with a clean OS load
Follow the instructions for “runfile installer method” in the cuda 7.5 linux install guide. Don’t skip any steps or fail to remove nouveau. Follow the instructions carefully.
Select “no” when prompted to install the driver after launching the CUDA 7.5 runfile installer.
Download the 361.45 driver runfile installer.
Install that driver.
Profit.

joseph85750 · July 1, 2016, 3:08pm

That process worked. Thanks for the help!

The part of the process which was giving me grief is here:
Linux accelerated computing instances - Amazon Elastic Compute Cloud

Where it states:
“You must reinstall the CUDA toolkit after installing the NVIDIA driver.”

Which was blowing away the original Nvidia driver again.

Robert_Crovella · July 1, 2016, 3:39pm

Normally, that works. But there are two additional considerations in this case:

The driver bundled with the CUDA 7.5 runfile installer (352.39) does not work correctly on EC2 instances - as indicated in the other thread I linked. You have to use a newer driver.
If using the runfile installer method, the actual order does not really matter (driver, then toolkit, or toolkit, then driver) as long as you deselect the option to install the driver when installing the toolkit. In other words, in my instruction sequence, you could have switched steps 4,5 with step 3 (thus effectively matching the AWS instructions) as long as you deselect the driver install as part of the toolkit install process.

Again, this doesn’t describe how to perform a similar operation using the package manager method. That exercise is left to the reader, given the above understanding of the issue.

Topic		Replies	Views
CUDA 7.5 unstable on EC2? CUDA Setup and Installation	15	16195	July 1, 2016
nvidia-smi hangs. cannot be killed even by SIGKILL CUDA Setup and Installation	1	10285	April 5, 2016
How do I install nvidia gpu driver on amazon ec2 instance running ubuntu? CUDA Setup and Installation	1	23406	April 5, 2016
In what step is nvidia-smi supposed to be installed? CUDA Programming and Performance	13	118328	December 16, 2022
Cuda for 6GB GTX 780? CUDA Setup and Installation	14	6109	July 21, 2014
Centos 7 crashes after CUDA 10.1 installation.. PLEASE HELP!!!!!!!! Linux	2	781	September 15, 2019
Issues with Nvidia Drivers on CentOS 7.6/7.7 CUDA Setup and Installation	7	4024	September 29, 2019
Mismatch in CUDA driver and runtime versions CUDA Setup and Installation ubuntu	6	800	September 3, 2024
CUDA installation on an AWS Unbuntu 14.04 hanging. CUDA Setup and Installation	1	901	February 25, 2016
CUDA 7.5 not working on 980TI in LInux Mint 17.2 CUDA Setup and Installation	8	3931	December 25, 2015

System hangs when issuing "nvidia-smi -q" command, after installing CUDA

nvidia-smi -pm 1

Related topics