Please provide the following info (tick the boxes after creating this topic):
Submission Type
Bug or Error
Feature Request
Documentation Issue
Question
Other
Workbench Version
Desktop App v0.44.8
CLI v0.21.3
Other
Host Machine operating system and location
Local Windows 11
Local Windows 10
Local macOS
Local Ubuntu 22.04
Remote Ubuntu 22.04
Other
I’m broken a lot of the time when trying to connect to my remote Linux server from my Windows machine. It appears to be usually a driver error. AI Workbench should log the expected and found driver version numbers when launch fails.
My linux server runs updates regularly. It seems that this almost always breaks AI Workbench remote. The workbench returns a useless message and the logs do not include all the details available. The log should include the expected and found driver version numbers. It does not
This is the error provided in the AI Workbench logs that is almost always wrong and is no help when troubleshooting
{"level":"error","error":"Process exited with status 1","cmd":"/home/joe/.nvwb/bin/wb-svc -quiet start-container-tool","stderr":"an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18\n\n","time":"2024-06-23T12:39:51-04:00","message":"SSHCmd.Run failed."}
{"level":"error","error":"an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18\n\n: Process exited with status 1","isWrapped":false,"isInteractive":false,"engine":"json","detail":"detail","time":"2024-06-23T12:39:51-04:00","message":"an error occured while executing '/home/joe/.nvwb/bin/wb-svc -quiet start-container-tool' on '192.168.1.154'"}
Running nvidia-smi --quiet --query-gpu
returns the following which is sort of more useful
(base) joe@hp-z820:~$ nvidia-smi --quiet --query-gpu
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.67
Running .nvwb/bin/wb-svc start-container-tool --start-container-too
returns more information
12:41PM INF starting host introspection
12:41PM ERR error when calling bash command error="exit status 18" command="nvidia-smi --query-gpu=index,uuid,gpu_name,driver_version --format=csv,noheader" stderr= stdout="Failed to initialize NVML: Driver/library version mismatch\nNVML library version: 550.67\n"
12:41PM ERR an error occurred while checking host state. try again error="Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18"
an error occurred while checking host state. try again: Failed to check gpu state - you may need to restart you computer to reload drivers: error when calling bash command: : exit status 18
Running cat /proc/driver/nvidia/version
returned
NVRM version: NVIDIA UNIX x86_64 Kernel Module 550.67 Tue Mar 12 23:54:15 UTC 2024
GCC version: gcc version 12.3.0 (Ubuntu 12.3.0-1ubuntu1~22.04)
I have no idea how to troubleshoot this. The last time I had this issue I had to
- Uninstall the nvidia driver
- Uninstall AI Workbench
- Reinstall AI Workbench which would install the driver it needed