NVRM: RmInitAdapter failed!

Hello I have two GPU’s both Tesla V100 running on Ubuntu 18.04.2 LTS with the following versions of NVIDIA-SMI 410.104, Driver Version: 410.104, CUDA Version: 10.0

When running nvidia-smi I can no longer see my two gpu’s instead I can only see GPU:0

I noticed that this error when running : $ dmesg | grep NVRM

[2956988.964627] NVRM: rm_init_adapter failed for device bearing minor number 0
[2956993.146123] NVRM: RmInitAdapter failed! (0x24:0x65:1090)
[2956993.146166] NVRM: rm_init_adapter failed for device bearing minor number 0
[2956997.148541] NVRM: RmInitAdapter failed! (0x24:0x65:1090)
[2956997.148579] NVRM: rm_init_adapter failed for device bearing minor number 0
[2957001.332258] NVRM: RmInitAdapter failed! (0x24:0x65:1090)
[2957001.332295] NVRM: rm_init_adapter failed for device bearing minor number 0
[2957005.545416] NVRM: RmInitAdapter failed! (0x24:0x65:1090)

I don’t know what happened, I was running docker on that GPU using the following command : sudo docker run -p 9999:8080 --runtime nvidia --env NVIDIA_VISIBLE_DEVICES="1" mltooling/ml-workspace-gpu:latest

Please make sure nvidia-persistenced is enabled to start on boot and is continuously running.
If the issue persists, please run nvidia-bug-report.sh as root and attach the resulting nvidia-bug-report.log.gz file to your post. You will have to rename the file ending to something else since the forum software doesn’t accept .gz files (nifty!)

does this report contain any personal information? the problem is on a sensitive server so I can’t really disclose much information.

It at least contains MAC addresses and GPU serial numbers, maybe IP-address and user name depending on verbosity of logging set on server.
It’s a text file containing xorg logs, dmesg, and general system info.

enabling nvidia-persistenced and a quick restart seems to solve the problem, thank you.

I’ll keep you updated if this happens again !

You should also make sure that no Xserver is installed or at least not enabled to start since this would restart in fast succession and can also lead to the gpu becoming inaccessible over time.