hello, i have a problem about program “nvidia-smi”。when i exec nvidia-smi,it reported “No devices were found”。
and i sometimes reported just one of my GPUs infomation
my environment:
OS: Ubuntu 18.04 LTS Server
Kernel: 4.15.0-163-generic
GPU: RTX3080 * 2
Driver: NVIDIA-Linux-x86_64-470.74.run
Checking your report log I can see repetition of this messsage:
[ 2143.889863] NVRM: GPU 0000:65:00.0: rm_init_adapter failed, device minor number 0
[ 2154.838560] NVRM: GPU 0000:65:00.0: RmInitAdapter failed! (0x23:0xffff:1204)
Looking through the Linux forums (where I moved the topic as well) I can see a lot threads showing different possible reasons for these initialization failures.
One possible solution is of course to update to the latest driver 470.86, following the installation instructions very closely.
Since this is only happening after about 40 minutes and they’re built into a server running headless, please start by properly setting up nvidia-persistenced to start on boot and make sure it’s continuously running.
If the issue is still occuring, I suspect an airflow issue, thus the gpus are overheating. Please monitor temperatures.
The gpus likely are blocking airflow/heating up each other if they’re in neighbouring slots since those are consumer type cards. Please check.