First an foremost: Yes, I’ve set the modeset=1
option for the kernel-module and sudo cat /sys/module/nvidia_drm/parameters/modeset
prints Y
.
Furthermore, libnvidia-egl-wayland1 is installed (version 1:1.1.9-1.1, though I also tried 1.1.11 but that didn’t help).
The log from nvidia-bug-report.sh
: nvidia-bug-report.log.gz (199.3 KB)
Creating a headless instance of a wlroots-based compositor fails on a T4 GPU (in a Amazon EC2 g4dn.xlarge instance), while it works on my local desktop machine with a Geforce 3060Ti. Both systems use the same distro (Ubuntu 22.04) and the same driver (nvidia-driver-525 525.85.12-0ubuntu1 from Ubuntu’s repo, though on the cloud machine with the T4 I also tried 515 and nvidia-driver-525-server and installing the “NVIDIA gaming driver”, following htt ps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html).
The nvidia driver installation on the cloud machine generally works, I successfully ran an Unreal Engine 4.27
PixelStreaming server, and I can also run the LXDE desktop in x11vnc (as described in
htt ps://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#configuring-x11vnc-checking-gpu-linux-server).
glxinfo and eglinfo show sensible values and in X11 I was even able to run Quake II with the
Yamagi Quake II source port (with OpenGL1.4, OpenGL 3.2 and OpenGL ES 3.0 renderers).
I’m using libwlroots10 0.15.1-2 from the Ubuntu repo, but I’ve also tried 0.16 from the gamescope PPA
(didn’t make a difference), and while I’m mostly using a patched (for headless support) version of the
cage compositor, this problem can be reproduced with the version of Sway that’s available in the Ubuntu 22.04
repos, by running:
WLR_BACKENDS=headless WLR_LIBINPUT_NO_DEVICES=1 WLR_NO_HARDWARE_CURSORS=1 WLR_BACKENDS=headless sway --unsupported-gpu -d
(-d
gives additional debug output).
While debugging the problem on the cloud machine with the T4, I noticed the following:
- (Not critical: eglQueryDmaBufFormatsEXT() returns two formats that don’t have modifiers: GR32, BGR8.
This also happens on my desktop with the Geforce 3060Ti.
In the sway console output this causes the[wlr] [EGL] command: eglQueryDmaBufModifiersEXT, error: EGL_BAD_PARAMETER (0x300c), message: "EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format
lines) -
eglQueryDmaBufModifiersEXT() (called by get_egl_dmabuf_modifiers() in wlroots/render/egl.c)
returns these modifiers (at least for format XR24, which wlroots uses):
These are the same modifiers that are returned on the desktop.0: 0x00FFFFFFFFFFFFFF // I think this one is actually added to the list by wlroots 1: 0x0300000000606010 2: 0x0300000000606011 3: 0x0300000000606012 4: 0x0300000000606013 5: 0x0300000000606014 6: 0x0300000000606015 7: 0x0300000000E08010 8: 0x0300000000E08011 9: 0x0300000000E08012 10: 0x0300000000E08013 11: 0x0300000000E08014 12: 0x0300000000E08015
-
gbm_bo_create_with_modifiers() fails when called with these modifiers and format
0x34325258
akaXR24
akaDRM_FORMAT_XRGB8888
(called bycreate_buffer()
in wlroots/render/allocator/gbm.c ; on the desktop this works and uses modifier0x0300000000E08014
, number 11 of the list.)
wlroots doesn’t log this, but instead silently uses a fallback: -
gbm_bo_create() with flags
GBM_BO_USE_SCANOUT | GBM_BO_USE_RENDERING
(and same format: XR24) then succeeds and, according to gbm_bo_get_modifier(), uses modifier 0x0300000000C00014 (which is not in the list returned by eglQueryDmaBufFormatsEXT()!)
The sway log message is[wlr] [render/allocator/gbm.c:140] Allocated 1280x720 GBM buffer (format 0x34325258, modifier 0xFFFFFFFFFFFFFF)
(the modifier here isDRM_FORMAT_MOD_INVALID
which is used as a fallback, and apparently in some contexts means “use whatever”, but it’s not the one actually used, that’s 0x0300000000C00014 but it doesn’t get logged by wlroots, unless you patch it).- I also tried different flags, like just
GBM_BO_USE_RENDERING
orGBM_BO_USE_RENDERING | GBM_BO_USE_LINEAR
. That didn’t help (though the modifier returned by gbm_bo_get_modifier() was slightly different, but still not from the list: 0x0300000000C00014).
- I also tried different flags, like just
- Creating an EGLImageKHR with
eglCreateImageKHR(eglDisplay, EGL_NO_CONTEXT, EGL_LINUX_DMA_BUF_EXT, NULL, attribs)
succeeds
(see htt ps://gitlab.freedesktop.org/wlroots/wlroots/-/blob/0.15/render/egl.c#L656-719 for attribs) -
Creating an FBO with that image however fails:
glEGLImageTargetRenderbufferStorageOES(GL_RENDERBUFFER, eglImage)
says (via GL debug output):GL_INVALID_OPERATION error generated. EGLImage not supported
(see create_buffer() in wlroots/render/gles2/renderer.c). On my desktop this works.
In the sway log, this causes these log lines:
[wlr] [GLES2] GL_INVALID_OPERATION error generated. EGLImage not supported
[wlr] [render/gles2/renderer.c:133] Failed to create FB0
(the first is really fromglEGLImageTargetRenderbufferStorageOES()
)
Out of curiosity I also tried using wlroot’s DRM backend instead of the headless one (WLR_BACKENDS=drm
), and while this doesn’t work on both machines, I get a lot further on the desktop (I assume it failed on my desktop because X11 was running).
On the server I had to start it in X11 as well because of permission problems ([wlr] [libseat] [common/terminal.c:149] Could not open target tty: Permission denied
), but it aborted very early on with:
00:00:00.009 [wlr] [backend/session/session.c:385] Ignoring '/dev/dri/card0': not a KMS device
00:00:00.009 [wlr] [backend/backend.c:217] Found 0 GPUs, cannot create backend
00:00:00.009 [wlr] [backend/backend.c:311] failed to add backend 'drm'
00:00:00.074 [sway/server.c:56] Unable to create backend
So, unlike on my desktop and despite /sys/module/nvidia_drm/parameters/modeset
returning Y
, drmIsKMS(dev->fd)
returns 0
- I guess this could be related to the whole problem?
drmIsKMS(fd)
basically calls ioctl(fd, DRM_IOCTL_MODE_GETRESOURCES, &drm_mode_card_res_var);
and checks the number of crtcs, connectors and encoders. I tried finding an implementation of this in your open-gpu-kernel-modules source, but failed, so I’m not sure where this is coming from and why it fails (maybe because no display is attached? Update: though on my desktop the headless case still works if I quit X11 and unplug my displays) - and no idea if this is really relevant for my problem, just thought I’d mention it.
PS: Sorry I had to screw up the links, but on posting I got the message “An error occurred: Sorry, new users can only put one link in a post.”