I installed CUDA 7.5 on a fresh EC2 g2.2xlarge running Ubuntu 14.04 (kernel 3.13.0-63-generic). Unfortunately the samples won’t run reliably. Sometimes I can run any sample any number of times. Other times, after a reboot, a sample may run a few times and then hang. This has been observed with deviceQuery as well. Once the application hangs, it can’t be interrupted or killed and the load average will climb with ksoftirqd dominating the CPU.
I would welcome suggestions on how to triage this.
Thanks for responding. Following your suggestion I wrote “options nvidia NVreg_EnableMSI=0” into /etc/modprobe.d/nvidia.conf. After a reboot, things looked promising as BlackScholes ran several hundred times without a problem. Sadly, after rebooting again, BlackScholes hung on the second run.
Exactly the same issue here. Fresh Ubuntu 14.04 EC2 AMI on g2.2xlarge, kernel 3.13.0-65-generic.
Following a reboot there are no devices under /dev/nvidia* (normal). When I run nvidia-smi I get the Killed message as above and gives the 2 devices /dev/nvidia0 and /dev/nvidiactl. When I run nvidia-smi again (or any CUDA-enabled software) it just hangs forever.
“Fixed” this by uninstalling CUDA 7.5 and installing CUDA 7.0, which worked straight away.
Yes, someone at NVIDIA is looking at it. I will respond once again when we have anything to report. Until we have something to report, I will not respond to additional requests for information – as I will have no new information to report.
We are aware of the issue, we can reproduce it, and someone is looking at it. I don’t have further details. If you desire personalized communication with NVIDIA on this issue, I suggest filing a bug at the developer.nvidia.com portal. However the issue is being looked at.
As a workaround, CUDA 7 works correctly on these instances (assuming you use the driver version circa CUDA 7 - e.g. 346.xx driver)
For those having trouble with this issue on linux g2.2xlarge or g2.8xlarge instances using CUDA 7.5, I suggest trying the 352.63 driver that was just posted:
I am on a g2.2xlarge instances using CUDA 7.5 with the 352.63 driver. Still seeing
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
modprobe: ERROR: could not insert ‘nvidia_352’: Unknown symbol in module, or unknown parameter (see dmesg)
cudaGetDeviceCount returned 38
→ no CUDA-capable device is detected
Result = FAIL
when running deviceQuery. Is anyone able to make it work?
What you are seeing is normal because the AWS Ubuntu kernel is basically a minimal version that does not support everything out of the box. I struggled with the same issue and then found out I needed to install a linux-generic kernel. Here is what I do if it helps:
We realised that the GPU on AWS G2 instance is of Grid type and installed specifically the drivers for Grid K520 (Official Drivers | NVIDIA) and since then, we have been able to have a stable configuration for CUDA. Rest all the methods failed for us on G2 instance.