I have a GPU VM instance created in Amazon AWS EC2 cloud. I followed these instructions for installing the latest Nvidia driver:
Using this selection for Amazon G2:
Product Type GRID
Product Series GRID Series
Product GRID K520
Operating System Linux 64-bit
Recommended/Beta Recommended/Certified
Then I installed CUDA 7.5:
CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer
Linux x86_64 CentOS 7 rpm(local)
cuda-repo-rhel7-7-5-local-7.5-18.x86_64.rpm
No errors.
After installation, I ran:
$ nvidia-smi -q
This hung the VM immediately. After a hard reset via the Amazon Management console, the VM came back up. I tried again, and it would still hang.
I de-installed Nvidia drivers and CUDA, and decided to try again.
This time, after installing Nvidia drivers, I ran the ‘nvidia-smi -q’ command. No problem. It returned the results quickly, no hung VM.
I then installed CUDA 7.0, thinking maybe 7.5 was the problem. After installing CUDA 7.0, I ran the ‘nvidia-smi -q’ command. This hung the VM.
So, it would appear there is some problem with the CUDA installation.
I read a post somewhere about “GPU Persistent Mode”, and tried:
nvidia-smi -pm 1
After setting this, the ‘nvidia-smi -q’ command wouldn’t hang the VM, but the command itself would never return an output and I couldn’t kill the process. I could ssh into the VM from another terminal without any issue, and view any errors, but didn’t see any.
I’m not sure if this ‘fix’ is relevant; there still seems to be a problem since the command doesn’t return any output.
Anything else I can try or look for problems?
Any help/tips would be appreciated.
Thanks!