hi, everyone.
I have a server which has four Tesla C2050 GPUs and I used to run CUDA programs on it. That is to say the hardware has nothing wrong.
But these days I have to reinstall the OS and of course the CUDA environment should be reinstall too.
The problem is after I reinstall the OS,the driver,the toolkit and the SDK, I get errors, :(
the error is
“NVIDIA:could not open the device file /dev/nvidia0(Input/Output error)
There is no device supporting CUDA”
what is wrong? I feel very upset about it after I’ve tried a lot of methods but failed.
Is there somebody can help?
The solution can be found in both the Linux toolkit release notes and the Linux getting started PDF. If you are running “headless”, you will need to install a boot script which will manually create the necessary /dev file system entries for the driver to work correctly.
thanks a lot.
I seem to realise where the problem is after I read the release notes again. It seems that the problem is I didn’t create a script to load the kernel module and create the entries. But is the script must be created? I got four files in the /dev : nvidia0 , nvidia1 , nvidia2 , nvidia3, Aren’t these files entries?
thanks again :)
Most (all?) Linux distributions nowadays repopulate /dev with only the devices actually present. So even if you can see the device files right now, you still need the boot script to make sure they are still there after the next reboot.
thank you
It seems that you are very familiar with the devices. I have some questions about the devices. I can see the device files but the files are 0 byte, is that normal?
I know nothing about the /dev, what is boot script? where can I find it and add my codes to make sure the devices are avalable?
:)
Yes, that is normal. The files are not real files with data in them, rather they are kind of placeholders that carry the access rights to the devices and allow them to be used as if they were files.
I see :)
but can I tell whether the device is avalable or not through the file? the file exists and the permission is root and rw, is that means the device is avalable?
It is available to root.
To make it available to other users, change the permission to:
crw-rw-rw- 1 root root 195, 0 May 16 16:31 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 May 16 16:31 /dev/nvidiactl
You could use the script in the release notes to create the entries with the proper permissions.
I did use the root to login. but still failed. I will check the permission later because I am at home.
but do you know how to create and run the script?
thank you