CUDA 7.5 unstable on EC2?

lindsay · September 24, 2015, 5:36am

I installed CUDA 7.5 on a fresh EC2 g2.2xlarge running Ubuntu 14.04 (kernel 3.13.0-63-generic). Unfortunately the samples won’t run reliably. Sometimes I can run any sample any number of times. Other times, after a reboot, a sample may run a few times and then hang. This has been observed with deviceQuery as well. Once the application hangs, it can’t be interrupted or killed and the load average will climb with ksoftirqd dominating the CPU.

I would welcome suggestions on how to triage this.

Robert_Crovella · September 24, 2015, 2:55pm

Just a complete guess here, you might try modifying the irq handling to see if it makes a difference:

[url]http://us.download.nvidia.com/XFree86/Linux-x86_64/352.41/README/knownissues.html[/url]

lindsay · September 24, 2015, 5:23pm

Thanks for responding. Following your suggestion I wrote “options nvidia NVreg_EnableMSI=0” into /etc/modprobe.d/nvidia.conf. After a reboot, things looked promising as BlackScholes ran several hundred times without a problem. Sadly, after rebooting again, BlackScholes hung on the second run.

leptogenesis · September 28, 2015, 4:25am

Having the same problem. For me, just installing it and running nvidia-smi results in:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
Killed

And then nvida-smi hangs on the next attempt to run.

adslcx · October 3, 2015, 1:13am

leptogenesis:

Having the same problem. For me, just installing it and running nvidia-smi results in:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
Killed

And then nvida-smi hangs on the next attempt to run.

I’m seeing this exact problem on a fresh Ubuntu instance. Anyone has any idea?

mun · October 4, 2015, 4:50pm

Exactly the same issue here. Fresh Ubuntu 14.04 EC2 AMI on g2.2xlarge, kernel 3.13.0-65-generic.

Following a reboot there are no devices under /dev/nvidia* (normal). When I run nvidia-smi I get the Killed message as above and gives the 2 devices /dev/nvidia0 and /dev/nvidiactl. When I run nvidia-smi again (or any CUDA-enabled software) it just hangs forever.

“Fixed” this by uninstalling CUDA 7.5 and installing CUDA 7.0, which worked straight away.

mtk · October 8, 2015, 9:37pm

Any news about this? Is someone in NVIDIA looking at it?

Robert_Crovella · October 8, 2015, 10:11pm

Yes, someone at NVIDIA is looking at it. I will respond once again when we have anything to report. Until we have something to report, I will not respond to additional requests for information – as I will have no new information to report.

We are aware of the issue, we can reproduce it, and someone is looking at it. I don’t have further details. If you desire personalized communication with NVIDIA on this issue, I suggest filing a bug at the developer.nvidia.com portal. However the issue is being looked at.

As a workaround, CUDA 7 works correctly on these instances (assuming you use the driver version circa CUDA 7 - e.g. 346.xx driver)

mtk · October 9, 2015, 7:20am

Thanks!

Robert_Crovella · November 17, 2015, 2:47pm

For those having trouble with this issue on linux g2.2xlarge or g2.8xlarge instances using CUDA 7.5, I suggest trying the 352.63 driver that was just posted:

[url]http://www.nvidia.com/object/unix.html[/url]

[url]http://www.nvidia.com/Download/driverResults.aspx/95159/en-us[/url]

JasonJuang · November 27, 2015, 8:03am

I am on a g2.2xlarge instances using CUDA 7.5 with the 352.63 driver. Still seeing

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

modprobe: ERROR: could not insert ‘nvidia_352’: Unknown symbol in module, or unknown parameter (see dmesg)
cudaGetDeviceCount returned 38
→ no CUDA-capable device is detected
Result = FAIL

when running deviceQuery. Is anyone able to make it work?

Martin_Peniak · January 9, 2016, 8:39pm

Jason,

What you are seeing is normal because the AWS Ubuntu kernel is basically a minimal version that does not support everything out of the box. I struggled with the same issue and then found out I needed to install a linux-generic kernel. Here is what I do if it helps:

export DEBIAN_FRONTEND=noninteractive
apt-get update -q -y
apt-get -q -y -o Dpkg::Options::=“–force-confdef” -o Dpkg::Options::=“–force-confold” install linux-generic
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
apt-get update -q -y
apt-get install cuda -q -y
echo ’ /usr/local/cuda/lib64
/usr/local/cuda/lib’ | tee -a /etc/ld.so.conf.d/cuda.conf > /dev/null

jbold · March 9, 2016, 9:44pm

Thank you @Martin! I was experiencing the same exact issue and your answer was the perfect fix.

ankitsingh · June 7, 2016, 9:03am

We realised that the GPU on AWS G2 instance is of Grid type and installed specifically the drivers for Grid K520 (Official Drivers | NVIDIA) and since then, we have been able to have a stable configuration for CUDA. Rest all the methods failed for us on G2 instance.

Sylvain_Fetiveau · June 9, 2016, 6:43pm

Replacing the version 352.39 of the nvidia driver that was coming with CUDA 7.5 with version 361.45 (downloaded) solved the problem for me:

first download the installer from http://www.nvidia.com/download/driverResults.aspx/103306/en-us

then unload the old driver, and install/build/load the new one:

sudo modprobe -r nvidia
sudo ./NVIDIA-Linux-x86_64-361.45.11.run

Robert_Crovella · July 1, 2016, 3:40pm

Some additional discussion is here:

[url]https://devtalk.nvidia.com/default/topic/942116/cuda-setup-and-installation/system-hangs-when-issuing-quot-nvidia-smi-q-quot-command-after-installing-cuda[/url]

merely a confirmation of the info in this thread.

Topic		Replies	Views
System hangs when issuing "nvidia-smi -q" command, after installing CUDA CUDA Setup and Installation	5	5717	July 1, 2016
CUDA installation on an AWS Unbuntu 14.04 hanging. CUDA Setup and Installation	1	908	February 25, 2016
"no CUDA-capable device is detected" for CUDA ver 7.5, Kubuntu 14.04 CUDA Setup and Installation	4	2437	February 25, 2016
I can run CUDA three times, then gpu stops responding CUDA Setup and Installation	3	1372	May 4, 2013
CUDA 2.0 beta linux kernel 2.6.18-53.1.14.el5 System hangs CUDA Programming and Performance	2	2818	May 7, 2008
Ubuntu 14.04 hangs with Cuda 7 CUDA Setup and Installation	3	1953	August 24, 2015
M60 setup CUDA Setup and Installation	3	1267	September 6, 2016
S870 causes kernel panic Device query of S870 crashes kernel CUDA Programming and Performance	27	25706	May 29, 2008
Card or driver seem to be inaccessible (Ubuntu 14.04) CUDA Setup and Installation	8	1544	September 4, 2016
CentOS hangs after installing CUDA 7.5 + Nvidia 352 driver CUDA Setup and Installation	1	1758	October 17, 2015

CUDA 7.5 unstable on EC2?

first download the installer from http://www.nvidia.com/download/driverResults.aspx/103306/en-us

then unload the old driver, and install/build/load the new one:

Related topics