EDIT: Sorry, wrong forum! Not sure how to delete this thread so I can put it in the right place. Can a mod help me out here?
I’ve been doing some NAMD work on my university’s computation server, but have run into a very strange issue whilst trying to use CUDA.
Attempting to open anything under /dev/nvidia* - nvidia0+nvidia1, nvidiactl, etc - causes the process to hang forever waiting for the syscall to complete. Pretty sure this makes it a zombie; it is certainly immune to any and all means of killing other than restarting the computer. Ctrl-Z can’t background the process, so I’m forced to kill my session and reconnect.
Restarting made the issue go away for a while, but since this is a shared server, that’s not terribly practical for a long-term solution. It crops up randomly; I first noticed it when one of my jobs abruptly deleted its output log and got stuck.
strace shows these two lines as the program dies:
stat("/dev/nvidiactl", {st_mode=S_IFCHR|0666, st_rdev=makedev(195, 255), ...}) = 0
open("/dev/nvidiactl", O_RDWR
It seems like the files can be stat’d without isuse, but trying to open them for reading never happens. This happens for both read+write and read-only file operations, and it occurs in any program that tries to interact with these devices (NAMD, tail, cat, etcetera)
The server is running Ubuntu 14.04.4 LTS.
/proc/driver/nvidia/version contains:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 352.68 Tue Dec 1 17:24:11 PST 2015
GCC version: gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.1)
nvcc -V gives:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
Commands like nvidia-smi and nvidia-debugdump lead to the violent death of my terminal, so I can’t get information from them. I can try them after another reboot, however. I’m not the system administrator, but I can have him do some queries if anything needs to be done as root.
I’ve poked around Google for a fair bit, but I haven’t been able to dredge anything up as of yet. Anybody have advice on how to proceed from here?