CUDA 7.5 unstable on EC2?

I installed CUDA 7.5 on a fresh EC2 g2.2xlarge running Ubuntu 14.04 (kernel 3.13.0-63-generic). Unfortunately the samples won’t run reliably. Sometimes I can run any sample any number of times. Other times, after a reboot, a sample may run a few times and then hang. This has been observed with deviceQuery as well. Once the application hangs, it can’t be interrupted or killed and the load average will climb with ksoftirqd dominating the CPU.

I would welcome suggestions on how to triage this.

Just a complete guess here, you might try modifying the irq handling to see if it makes a difference:

http://us.download.nvidia.com/XFree86/Linux-x86_64/352.41/README/knownissues.html

Thanks for responding. Following your suggestion I wrote “options nvidia NVreg_EnableMSI=0” into /etc/modprobe.d/nvidia.conf. After a reboot, things looked promising as BlackScholes ran several hundred times without a problem. Sadly, after rebooting again, BlackScholes hung on the second run.

Having the same problem. For me, just installing it and running nvidia-smi results in:

+------------------------------------------------------+                       
| NVIDIA-SMI 352.39     Driver Version: 352.39         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
Killed

And then nvida-smi hangs on the next attempt to run.

I’m seeing this exact problem on a fresh Ubuntu instance. Anyone has any idea?

Exactly the same issue here. Fresh Ubuntu 14.04 EC2 AMI on g2.2xlarge, kernel 3.13.0-65-generic.

Following a reboot there are no devices under /dev/nvidia* (normal). When I run nvidia-smi I get the Killed message as above and gives the 2 devices /dev/nvidia0 and /dev/nvidiactl. When I run nvidia-smi again (or any CUDA-enabled software) it just hangs forever.

“Fixed” this by uninstalling CUDA 7.5 and installing CUDA 7.0, which worked straight away.

Any news about this? Is someone in NVIDIA looking at it?

Yes, someone at NVIDIA is looking at it. I will respond once again when we have anything to report. Until we have something to report, I will not respond to additional requests for information – as I will have no new information to report.

We are aware of the issue, we can reproduce it, and someone is looking at it. I don’t have further details. If you desire personalized communication with NVIDIA on this issue, I suggest filing a bug at the developer.nvidia.com portal. However the issue is being looked at.

As a workaround, CUDA 7 works correctly on these instances (assuming you use the driver version circa CUDA 7 - e.g. 346.xx driver)

Thanks!

For those having trouble with this issue on linux g2.2xlarge or g2.8xlarge instances using CUDA 7.5, I suggest trying the 352.63 driver that was just posted:

http://www.nvidia.com/object/unix.html

http://www.nvidia.com/Download/driverResults.aspx/95159/en-us

I am on a g2.2xlarge instances using CUDA 7.5 with the 352.63 driver. Still seeing

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

modprobe: ERROR: could not insert ‘nvidia_352’: Unknown symbol in module, or unknown parameter (see dmesg)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL

when running deviceQuery. Is anyone able to make it work?

Jason,

What you are seeing is normal because the AWS Ubuntu kernel is basically a minimal version that does not support everything out of the box. I struggled with the same issue and then found out I needed to install a linux-generic kernel. Here is what I do if it helps:

export DEBIAN_FRONTEND=noninteractive
apt-get update -q -y
apt-get -q -y -o Dpkg::Options::="–force-confdef" -o Dpkg::Options::="–force-confold" install linux-generic
wget http://developer.download.nvidia.com/compute/cuda/7.5/Prod/local_installers/cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
apt-get update -q -y
apt-get install cuda -q -y
echo ’ /usr/local/cuda/lib64
/usr/local/cuda/lib’ | tee -a /etc/ld.so.conf.d/cuda.conf > /dev/null

Thank you @Martin! I was experiencing the same exact issue and your answer was the perfect fix.

We realised that the GPU on AWS G2 instance is of Grid type and installed specifically the drivers for Grid K520 (http://www.nvidia.com/Download/index.aspx?lang=en-us) and since then, we have been able to have a stable configuration for CUDA. Rest all the methods failed for us on G2 instance.

Replacing the version 352.39 of the nvidia driver that was coming with CUDA 7.5 with version 361.45 (downloaded) solved the problem for me:

first download the installer from http://www.nvidia.com/download/driverResults.aspx/103306/en-us

then unload the old driver, and install/build/load the new one:

sudo modprobe -r nvidia
sudo ./NVIDIA-Linux-x86_64-361.45.11.run

Some additional discussion is here:

https://devtalk.nvidia.com/default/topic/942116/cuda-setup-and-installation/system-hangs-when-issuing-quot-nvidia-smi-q-quot-command-after-installing-cuda

merely a confirmation of the info in this thread.