(Headless) Wayland (with wlroots) doesn't work on T4 GPU

First an foremost: Yes, I’ve set the modeset=1 option for the kernel-module and sudo cat /sys/module/nvidia_drm/parameters/modeset prints Y.
Furthermore, libnvidia-egl-wayland1 is installed (version 1:1.1.9-1.1, though I also tried 1.1.11 but that didn’t help).

The log from nvidia-bug-report.sh: nvidia-bug-report.log.gz (199.3 KB)

Creating a headless instance of a wlroots-based compositor fails on a T4 GPU (in a Amazon EC2 g4dn.xlarge instance), while it works on my local desktop machine with a Geforce 3060Ti. Both systems use the same distro (Ubuntu 22.04) and the same driver (nvidia-driver-525 525.85.12-0ubuntu1 from Ubuntu’s repo, though on the cloud machine with the T4 I also tried 515 and nvidia-driver-525-server and installing the “NVIDIA gaming driver”, following htt ps://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html).

The nvidia driver installation on the cloud machine generally works, I successfully ran an Unreal Engine 4.27
PixelStreaming server, and I can also run the LXDE desktop in x11vnc (as described in
htt ps://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#configuring-x11vnc-checking-gpu-linux-server).
glxinfo and eglinfo show sensible values and in X11 I was even able to run Quake II with the
Yamagi Quake II source port (with OpenGL1.4, OpenGL 3.2 and OpenGL ES 3.0 renderers).

I’m using libwlroots10 0.15.1-2 from the Ubuntu repo, but I’ve also tried 0.16 from the gamescope PPA
(didn’t make a difference), and while I’m mostly using a patched (for headless support) version of the
cage compositor, this problem can be reproduced with the version of Sway that’s available in the Ubuntu 22.04
repos, by running:
WLR_BACKENDS=headless WLR_LIBINPUT_NO_DEVICES=1 WLR_NO_HARDWARE_CURSORS=1 WLR_BACKENDS=headless sway --unsupported-gpu -d
(-d gives additional debug output).

While debugging the problem on the cloud machine with the T4, I noticed the following:

  • (Not critical: eglQueryDmaBufFormatsEXT() returns two formats that don’t have modifiers: GR32, BGR8.
    This also happens on my desktop with the Geforce 3060Ti.
    In the sway console output this causes the [wlr] [EGL] command: eglQueryDmaBufModifiersEXT, error: EGL_BAD_PARAMETER (0x300c), message: "EGL_BAD_PARAMETER error: In eglQueryDmaBufModifiersEXT: Invalid format lines)
  • eglQueryDmaBufModifiersEXT() (called by get_egl_dmabuf_modifiers() in wlroots/render/egl.c)
    returns these modifiers (at least for format XR24, which wlroots uses):
      0: 0x00FFFFFFFFFFFFFF // I think this one is actually added to the list by wlroots
      1: 0x0300000000606010
      2: 0x0300000000606011
      3: 0x0300000000606012
      4: 0x0300000000606013
      5: 0x0300000000606014
      6: 0x0300000000606015
      7: 0x0300000000E08010
      8: 0x0300000000E08011
      9: 0x0300000000E08012
     10: 0x0300000000E08013
     11: 0x0300000000E08014
     12: 0x0300000000E08015
    
    These are the same modifiers that are returned on the desktop.
  • gbm_bo_create_with_modifiers() fails when called with these modifiers and format 0x34325258 aka XR24 aka DRM_FORMAT_XRGB8888
    (called by create_buffer() in wlroots/render/allocator/gbm.c ; on the desktop this works and uses modifier 0x0300000000E08014, number 11 of the list.)
    wlroots doesn’t log this, but instead silently uses a fallback:
  • gbm_bo_create() with flags GBM_BO_USE_SCANOUT | GBM_BO_USE_RENDERING (and same format: XR24) then succeeds and, according to gbm_bo_get_modifier(), uses modifier 0x0300000000C00014 (which is not in the list returned by eglQueryDmaBufFormatsEXT()!)
    The sway log message is [wlr] [render/allocator/gbm.c:140] Allocated 1280x720 GBM buffer (format 0x34325258, modifier 0xFFFFFFFFFFFFFF) (the modifier here is DRM_FORMAT_MOD_INVALID which is used as a fallback, and apparently in some contexts means “use whatever”, but it’s not the one actually used, that’s 0x0300000000C00014 but it doesn’t get logged by wlroots, unless you patch it).
    • I also tried different flags, like just GBM_BO_USE_RENDERING or GBM_BO_USE_RENDERING | GBM_BO_USE_LINEAR. That didn’t help (though the modifier returned by gbm_bo_get_modifier() was slightly different, but still not from the list: 0x0300000000C00014).
  • Creating an EGLImageKHR with eglCreateImageKHR(eglDisplay, EGL_NO_CONTEXT, EGL_LINUX_DMA_BUF_EXT, NULL, attribs) succeeds
    (see htt ps://gitlab.freedesktop.org/wlroots/wlroots/-/blob/0.15/render/egl.c#L656-719 for attribs)
  • Creating an FBO with that image however fails: glEGLImageTargetRenderbufferStorageOES(GL_RENDERBUFFER, eglImage)
    says (via GL debug output): GL_INVALID_OPERATION error generated. EGLImage not supported
    (see create_buffer() in wlroots/render/gles2/renderer.c). On my desktop this works.
    In the sway log, this causes these log lines:
    [wlr] [GLES2] GL_INVALID_OPERATION error generated. EGLImage not supported
    [wlr] [render/gles2/renderer.c:133] Failed to create FB0
    (the first is really from glEGLImageTargetRenderbufferStorageOES())

Out of curiosity I also tried using wlroot’s DRM backend instead of the headless one (WLR_BACKENDS=drm), and while this doesn’t work on both machines, I get a lot further on the desktop (I assume it failed on my desktop because X11 was running).
On the server I had to start it in X11 as well because of permission problems ([wlr] [libseat] [common/terminal.c:149] Could not open target tty: Permission denied), but it aborted very early on with:

00:00:00.009 [wlr] [backend/session/session.c:385] Ignoring '/dev/dri/card0': not a KMS device
00:00:00.009 [wlr] [backend/backend.c:217] Found 0 GPUs, cannot create backend
00:00:00.009 [wlr] [backend/backend.c:311] failed to add backend 'drm'
00:00:00.074 [sway/server.c:56] Unable to create backend

So, unlike on my desktop and despite /sys/module/nvidia_drm/parameters/modeset returning Y, drmIsKMS(dev->fd) returns 0 - I guess this could be related to the whole problem?
drmIsKMS(fd) basically calls ioctl(fd, DRM_IOCTL_MODE_GETRESOURCES, &drm_mode_card_res_var); and checks the number of crtcs, connectors and encoders. I tried finding an implementation of this in your open-gpu-kernel-modules source, but failed, so I’m not sure where this is coming from and why it fails (maybe because no display is attached? Update: though on my desktop the headless case still works if I quit X11 and unplug my displays) - and no idea if this is really relevant for my problem, just thought I’d mention it.

PS: Sorry I had to screw up the links, but on posting I got the message “An error occurred: Sorry, new users can only put one link in a post.”

3 Likes

@DanielGibson thank you for reaching out. Can you in a couple simple sentences describe what you are trying to achieve and what is the problem you encountered?

I’m trying to run an application in a wlroots-based wayland compositor (a patched cage, but can be reproduced with an unpatched sway) headlessly (this means: without outputting to a display), so I can grab its output and stream it into a texture of an unreal-engine 4.27 application (that runs unreal engine “PixelStreaming” on the cloud server, so its output can be viewed in a webbrowser that connects to the server).
This works on my desktop (with a geforce 3060Ti), but not on the amazon cloud server with a T4 GPU, even when using the exact same driver, because the wayland compositor fails, specifically, it calling glEGLImageTargetRenderbufferStorageOES() fails (on the cloud server with T4 GPU, the call succeeds on my local geforce 3060ti).

(Another usecase for running a wayland compositor headlessly would be using wayvnc or similar, to run a desktop in the cloud, similar to the x11vnc usecase documented in Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation)

@DanielGibson thank you for the clarifications. I will have a vGPU specialist to review this and update you shortly.

1 Like

Thank you very much!

@DanielGibson Since this is running on an AWS instance, please engage with Amazon for your setup’s requirements. Please also consider the following for this case:

  • Customers must use the vGPU drivers provided by AMAZON.

I’m not using vGPUs, as I wrote I’m using an Amazon g4dn.xlarge instance, which gives me a dedicated T4 GPU, passed through into “my” VM/instance.

As I wrote, I tried drivers provided by Amazon, as well as the ones from the Ubuntu repo - and they all work in general: I can run the Unreal Engine (with PixelStreaming; it uses Vulkan), I can run X11 (via x11vnc) and in X11 I can run OpenGL games (like Yamagi Quake II; yes, I’ve verified that they’re indeed using the nvidia driver and not llvmpipe or some other software implementation).

What doesn’t work is running Wayland headlessly, even though the same drivers support that on other GPUs.
I’m pretty sure this is a driver bug, as the information it gives me is inconsistent (see the “modifiers” returned by eglQueryDmaBufModifiersEXT() that don’t work with gbm_bo_create_with_modifiers(), and then gbm_bo_create() giving me a different modifier (and the EGLImageKHR I get from eglCreateImageKHR() then still doesn’t work with glEGLImageTargetRenderbufferStorageOES()).

I don’t think it makes much sense to contact Amazon about this, as I’m pretty sure that the drivers that they provide are developed by nvidia and not Amazon.

  • AMAZON does not support Wayland as mentioned here

I’m not using NICE DCV.

Most remote protocols like NICE DCV, VMware Horizon View do not support Wayland.

I don’t care, I don’t plan on using any of them (if I wanted to use a wayland-based “cloud desktop” - which I don’t, it’s just another usecase that’s a affected by the bug - I’d probably use VNC, via wayvnc, or RDP, which at least GNOME’s wayland compositor supports).

Wayland is not supported with vGPU drivers also

I’m not using a vGPU.


Please just forward this bugreport to a Linux driver developer, I’m sure they will understand what my report is about.

Also note that Amazon officially supports using the “Public NVIDIA drivers”, see Install NVIDIA drivers on Linux instances - Amazon Elastic Compute Cloud (“Option 2: Public NVIDIA drivers”).

@DanielGibson Did you find a solution? I know this post is a year old.
I’m trying to remote into a headless g4dn with a Tesla GPU using NoMachine NX and run a Wayland/Weston session with no luck.
I’ve tried all the various latest NVIDIA drivers with the same results.
The error when I try to start a Weston session is “no drm device found”. The NVIDIA GPU is used when I run Xorg and all the relevant kernel modules are loaded, modeset=1.

For me the issue was fixed in the 535 drivers.
I never tried Weston though, but only wlroots-based compositors, and I had to explicitly tell wlroots to use headless mode (by setting the WLR_BACKENDS environment variable to headless - but this is wlroots specific, if other compositors support headless mode, enabling it will be different)

@DanielGibson Thanks for the info. Will try Weston headless mode.