LAST EDIT:
Uh, I got it working. Operation not permitted
and running as user got me thinking: I should run Xorg as root. Qubes uses /usr/bin/qubes-run-xorg
which is a shell script that ends like
if qsvc guivm-gui-agent; then
DISPLAY_XORG=:1
# Create Xorg. Xephyr will be started using qubes-start-xephyr later.
exec runuser -u "$DEFAULT_USER" -- /bin/sh -l -c "exec $XORG $DISPLAY_XORG -nolisten tcp vt07 -wr -config xorg-qubes.conf > ~/.xorg-errors 2>&1" &
else
# Use sh -l here to load all session startup scripts (/etc/profile, ~/.profile
# etc) to populate environment. This is the environment that will be used for
# all user applications and qrexec calls.
exec /usr/bin/qubes-gui-runuser "$DEFAULT_USER" /bin/sh -l -c "exec /usr/bin/xinit $XSESSION -- $XORG :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf > ~/.xsession-errors 2>&1"
fi
adding DEFAULT_USER="root"
above this if statement launches xorg as root, and everything just works with the final config at the bottom.
ORIGINAL POST:
Hello, I thought I should chime in with more information about the problem as I’ve just ran into this as well. It is also very nice to know that NVIDIA is now allowing for NVIDIA GPUs to run inside of Linux VMs! Last time I tried in September I believe, I didn’t get nearly this far.
I will start from the beginning. I am attempting to get CUDA applications running in a VM in a “headless” manner. When a Qubes VM starts without an NVIDIA GPU attached, this is what the Xorg.0.log looks like: Standard Qubes Xorg.0.log (27.8 KB) and we can see Xorg is working as expected:
root 526 0.0 0.1 11504 6236 tty7 S+ 10:38 0:00 /usr/bin/qubes-gui-runuser user /bin/sh -l -c exec /usr/bin/xinit /etc/X11/xinit/xinitrc -- /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf > ~/.xsession-errors 2>&1
user 552 0.0 0.0 4148 1292 ? Ss 10:38 0:00 /usr/bin/xinit /etc/X11/xinit/xinitrc -- /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf
user 627 1.7 2.7 286832 110728 ? Sl 10:38 0:36 /usr/libexec/Xorg :0 -nolisten tcp vt07 -wr -config xorg-qubes.conf
of note is that Xorg is started as user. The associated xorg-qubes.conf looks like this:
Section "Module"
Load "fb"
EndSection
Section "ServerLayout"
Identifier "Default Layout"
Screen 0 "Screen0" 0 0
InputDevice "qubesdev"
EndSection
Section "Device"
Identifier "Videocard0"
Driver "dummyqbs"
VideoRam 22501
Option "GUIDomID" "0"
EndSection
Section "Monitor"
Identifier "Monitor0"
HorizSync 49-50
VertRefresh 34-35
Modeline "QB2560x1440" 128 2560 2561 2562 2563 1440 1441 1442 1443
EndSection
Section "Screen"
Identifier "Screen0"
Device "Videocard0"
Monitor "Monitor0"
DefaultDepth 24
SubSection "Display"
Viewport 0 0
Depth 24
Modes "QB2560x1440"
EndSubSection
EndSection
Section "InputDevice"
Identifier "qubesdev"
Driver "qubes"
EndSection
This file is made from a template by the qubes-gui-agent system service on service start up.
With this setup, as long as the nvidia device is not referenced in the xorg.confs, the GPU is in a strange state:
bash-5.1# nvidia-smi
Wed Feb 16 12:33:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:00:08.0 Off | N/A |
| 30% 37C P0 N/A / 220W | 0MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But torch reports cuda as being available:
(base) [user@gpu-linux ~]$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
>>>
And the Xorg.0.log:
Xorg.0.log (33.0 KB)
This is different from my previous experience, IIRC before, Xorg needed to know about the GPU in order for CUDA to work. That’s nice. However, I would ideally like to have Coolbits enabled. As far as I know, Coolbits are dependent on Xorg for whatever reason. So, maybe it’ll work if I just tell Xorg about the device myself?
(base) [user@gpu-linux ~]$ cat /etc/X11/xorg.conf.d/nvidia.conf
Section "Device"
# discrete GPU NVIDIA
Identifier "nvidia"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce RTX 3070"
Option "Coolbits" "28"
BusID "PCI:8:0:0"
EndSection
Restarting Xorg, pretty much does nothing. nvidia-smi
can still report information, torch says CUDA is available. OK, maybe I need to make a screen out of it?
bash-5.1# cat /etc/X11/xorg.conf.d/nvidia.conf
Section "Screen"
# virtual monitor
Identifier "Screen1"
# discrete GPU nvidia
Device "nvidia"
# virtual monitor
Monitor "Monitor1"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Monitor"
Identifier "Monitor1"
VendorName "Unknown"
Option "DPMS"
EndSection
Section "Device"
# discrete GPU NVIDIA
Identifier "nvidia"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce RTX 3070"
Option "Coolbits" "28"
BusID "PCI:8:0:0"
Of course nothing. Maybe if I add my own server layout?
bash-5.1# cat /etc/X11/xorg.conf.d/nvidia.conf
Section "ServerLayout"
Identifier "Default Layout"
# Option "AllowNVIDIAGPUScreens"
Screen 0 "Screen0" 0 0
Screen 1 "Screen1"
InputDevice "qubesdev"
EndSection
Section "Screen"
# virtual monitor
Identifier "Screen1"
# discrete GPU nvidia
Device "nvidia"
# virtual monitor
Monitor "Monitor1"
DefaultDepth 24
SubSection "Display"
Depth 24
EndSubSection
EndSection
Section "Monitor"
Identifier "Monitor1"
VendorName "Unknown"
Option "DPMS"
EndSection
Section "Device"
# discrete GPU NVIDIA
Identifier "nvidia"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "GeForce RTX 3070"
Option "Coolbits" "28"
BusID "PCI:8:0:0"
EndSection
Restarting Xorg, and it doesn’t work. This is where the xf86OpenConsole: VT_ACTIVATE failed: Operation not permitted
error comes in. Perhaps my xorg.conf is very naive. I don’t really know. Heres the log
crashed Xorg.0.log (5.4 KB)
At this point I’m running in a console from another VM (qvm-console-in-dispvm gpu-linux
in dom0), and nvidia-smi still seems to recognize the GPU and pytorch reports CUDA as available. Removing the nvidia.conf file and restarting Xorg works of course. Here is the full dmesg log:
dmesg.log (70.5 KB)
Another note: I’ve had to restart my computer a few times because it seems eventually the driver/hardware/something needs it. Even after removing the nvidia.conf file, it will not work at all and dmesg will contain RmInitAdapter
errors. I have lost the codes for these, but if it happens again I will post.
EDIT: I found one of the errors. I do not know the context for this one. I was just trying random Xorg configurations:
[ 4655.853492] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0x65:1401)
[ 4655.853925] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
[ 4659.880394] NVRM: GPU 0000:00:08.0: RmInitAdapter failed! (0x23:0x65:1401)
[ 4659.880770] NVRM: GPU 0000:00:08.0: rm_init_adapter failed, device minor number 0
And heres another one that I remember seeing, excuse the ??? as I don’t remember what these numbers were
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! ([???]:[???]:1451)
?????? X_ID ????????
NVRM: GPU 0000:00:08.0: RmInitAdapter failed! ([???]:[???]:1451)
I believe 1451 was the code, maybe it was 1651 I’m not sure. Of course these numbers are opaque to me.