Improve diagnostics when receiving startup error on Linux NVIDIA driver after NVIDIA driver update

Please provide the following info (tick the boxes after creating this topic):

Submission Type
Bug or Error
Feature Request
Documentation Issue
Question
Other

Workbench Version
Desktop App v0.44.8
CLI v0.21.3
Other

Host Machine operating system and location
Local Windows 11
Local Windows 10
Local macOS
Local Ubuntu 22.04
Remote Ubuntu 22.04
Other

I’m broken a lot of the time when trying to connect to my remote Linux server from my Windows machine. It appears to be usually a driver error. AI Workbench should log the expected and found driver version numbers when launch fails.

My linux server runs updates regularly. It seems that this almost always breaks AI Workbench remote. The workbench returns a useless message and the logs do not include all the details available. The log should include the expected and found driver version numbers. It does not

This is the error provided in the AI Workbench logs that is almost always wrong and is no help when troubleshooting

{"level":"error","error":"Process exited with status 1","cmd":"/home/joe/.nvwb/bin/wb-svc -quiet start-container-tool","stderr":"an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18\n\n","time":"2024-06-23T12:39:51-04:00","message":"SSHCmd.Run failed."}
{"level":"error","error":"an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18\n\n: Process exited with status 1","isWrapped":false,"isInteractive":false,"engine":"json","detail":"detail","time":"2024-06-23T12:39:51-04:00","message":"an error occured while executing '/home/joe/.nvwb/bin/wb-svc -quiet start-container-tool' on '192.168.1.154'"}

Running nvidia-smi --quiet --query-gpu returns the following which is sort of more useful

(base) joe@hp-z820:~$ nvidia-smi --quiet --query-gpu
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.67

Running .nvwb/bin/wb-svc start-container-tool --start-container-too returns more information

12:41PM INF starting host introspection
12:41PM ERR error when calling bash command error="exit status 18" command="nvidia-smi --query-gpu=index,uuid,gpu_name,driver_version --format=csv,noheader" stderr= stdout="Failed to initialize NVML: Driver/library version mismatch\nNVML library version: 550.67\n"
12:41PM ERR an error occurred while checking host state. try again error="Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18"
an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18

Running cat /proc/driver/nvidia/version returned

NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.67  Tue Mar 12 23:54:15 UTC 2024
GCC version:  gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)

I have no idea how to troubleshoot this. The last time I had this issue I had to

  1. Uninstall the nvidia driver
  2. Uninstall AI Workbench
  3. Reinstall AI Workbench which would install the driver it needed

Restart fixed it this time so the problem was different than the previous time I had the same messages.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.