eglCreateImage and eglCreateImage KHR fail with EGL_BAD_DISPLAY

This is specific to EGL+gles and does not affect GLX. This is also specific to Linux too (only place tested). eglCreateImage is used as part of compositing (texture from pixmap) in X11. This started in more recent releases in the past few months (it used to work 6-12+ months ago). I’ve tried both eglCreateImage and eglCreateImageKHR (gotten from eglGetProcAddress). EGL supports pixmaps: EGL_KHR_image_pixmap is there in extension strings. This happens with 100% reproducability. This ONLY Affects NVIDIA drivers. Mesa works fine (Intel, Radeon, Nouveau, VC4 drivers on ARM). Also works fine with Mali closed drivers on ARM too.

Every time eglCreateImage is called with a valid and correct EGLDisplay handle (the exact same handle is used for eglSwapBuffer, eglQuerySurface etc. and works fine both before and after eglCreateImage). It is not terminated in between. I have tried with EGL_NO_CONTEXT and the actual context handle. EGL_NATIVE_PIXMAP_KHR is used and attributes is NULL. I’ve tried providing attributes. I’m unable to figure out EXACTLY why it thinks the display is bad. Every other egl API call thinks the display is good.

Relevant bit of code:

surface = eglCreateImage(egl_disp, EGL_NO_CONTEXT, EGL_NATIVE_PIXMAP_KHR, (void *)pixmap, NULL);

The surface returned is NULL. eglGetError() reports 0x3008 (EGL_BAD_DISPLAY) after this create image. the egl_disp as I mentioned I’ve printed the value and it’s constant across swap buffers, query surfaces etc. and works fine there without error. The context is actually bound to the thread at the time too. Everything else renders and swaps and so on correctly.

I cannot find a way of getting more debug on why it thinks it’s a bad display, due to it being closed source, so I’m here asking for help or information on how to debug and find out. I suspect it’s a change inside the drivers that altered it from working to not working. Maybe it has nothing to do with the display handle at all? The pixmap is valid…

How can I get more debug info out of libEGL for this?

FYI The relevant libraries and window manager/compositor is:

http://www.enlightenment.org

Relevant download ( git master: https://www.enlightenment.org/download ) scroll down about 1/2 way.

nvidia-bug-report.log.gz (239 KB)

Maybe this will help, libglvnd is the most likely change in the last 6-12+ months to cause it.

https://github.com/libretro/RetroArch/issues/4790#issuecomment-291197283

while interesting… i’m not getting EGL_NO_DISPLAY (which is 0/NULL). egl display is a real pointer value. :) just for giggles i tried:

export EGL_PLATFORM=0x31D5
export EGL_PLATFORM=0x31D6

both of the above. still no go (those are the x11 platform id’s taken from https://github.com/NVIDIA/libglvnd/blob/master/src/generate/xml/egl.xml).

eglGetDisplay is all fine and working… and as I mentioned - all other rendering is ok. :(

Someone running into the same issue internally pointed me at this thread. Sorry I didn’t notice it before. Are you calling eglCreateImage() or eglCreateImageKHR()? We don’t expose EGL 1.5 on desktop drivers yet, so only the latter is available. With GLVND, you will get an EGL_BAD_DISPLAY error if you call a function on a given display that isn’t supported by the underlying driver. Switching to eglCreateImageKHR() fixed the issue for the internal user.

Yes. we actually use both eglCreateImage() and eglCreateImageKHR() now. we query the egl version string and use eglCreateImage() if its egl 1.5 and higher. otherwise use eglCreateImageKHR(). we had this bug as above where we preferred eglCreateImage() if it existed assuming that a driver unable to support this wouldn’t even have it… but glvnd changed things up a bit… :) so we fixed it and that got rid of our “black windows on nvidia egl/gles” bug but we have a performance issue though.

framerate goes down based on the number of pixmaps we have to bind to render. the more windows we have to render, the slower things get. a LOT. it doesn’t affect the glx drivers. our gl and gles code is almost the same sharing almost all of the gl*() called. the differences are almost entirely glx vs egl and it used to be fast on nvidia with egl/gles, but no longer for the past year or so. :( a different problem anyway

Are you re-binding the pixmaps every time there’s damage? The GLX extension has a fast path for that in our code. The EGL extension doesn’t, since the spec doesn’t have the same lack of clarity implying the need for a re-bind to get up to date contents IIRC, so I assumed apps wouldn’t be binding as frequently.

sorry about the delay - have a lot going on right now. :)

with glx it’s glXBindTexImage() and that seems fast. we even don’t bother to unbind. but for egl+gles it’s glEGLImageTargetTexture2DOES(). but to work on other drivers like Mali and other mobile drivers we have to re-create the egl image every frame and that is super-slow on nvidia egl/gles. … i suspect a server round-trip, where mesa probably just does a nop or does it internally.

Thanks for the feedback, that’s interesting. I don’t see any round-trips there from a quick inspection, but round trips aren’t actually that expensive (glXBindTexImage() does a round trip in our driver) by themselves.

To be clear: You’re not creating a new image from the pixmap every frame, just rebinding the same image to the same texture over and over? I think I see room to optimize that, but it also seems like a bug in the other drivers that this has any effect. Do you add mipmaps or something to the texture after binding the image? That would detach the image and hence require a rebind like this, and would also invalidate the optimization opportunities I have in mind. However, I assume that would also mean you’re using TEXTURE_2D rather than TEXTURE_EXTERNAL_OES, which technically invalidates any of the content in the image at bind time anyway, though I think it would be preserved on our driver’s implementation.

Actually we eglcreate+destoryimage every time we render that pixmap/texture. as above - necessary for drivers like mali. if you don’t you end up with stale old content as their interpretation is that you must have a new egl image to map to the buffer that was swapped and it remains pointing to that buffer until destroyed. it will never be updated again. so the create + destroy is fast on mesa drivers and mali closed drivers etc. …but not on nvidia. it’s necessary for other drivers, but not for nvidia’s. no mipmaps - any supersampling is done in the shaders on the fly.

OK, that’s unfortunate. In that case, there’s no trivial fix on our side. I’ll file a bug to track down why it got slower, but I’d suggest making the current code a WAR for broken drivers on your side, and using the fast path on NVIDIA drivers and any others. Drivers don’t get to decide to use buffer flips for pixmap updates without proper tracking in a way that violates the EGLImage spec just because they want to. If only there were a CTS test for this…

Any idea roughly which NVIDIA drivers started showing the slowdown?

yes. it is unfortunate. :( i’ve heard it from 2 embedded gpu vendors that the correct spec from gl’s point of view is to destroy and re-create every frame because the eglimage has to remain unchanged unless you changed it yourself. i think you guys look at it from the x11 view - it’s a pixmap and the pixmap can be drawn to any time and the egl image just maps to the pixmap thus implicitly can have content change, or something along those lines. it’s a deep divide as to how foreign out-of-process buffer sources should be treated.

so we now have a special nvidia-only path that avoids the destroy + create each time and of course it’s all fast again.

now as for when? ummmm… last year… i really don’t remember the exact version or even the time now… :( sorry about that. :(

I’ve gone through a selection of drivers from the past 4 years and don’t see any measurable change in speed for Create+Bind+Trivial Draw+Destroy in a targeted test.

The create step is similar in cost to glXCreatePixmap() for us. This is when we map in the images to the local process’s GPU address space, and this is an expensive operation. For GLX, the container is GLXPixmap, for EGL you create an EGLImage directly from an X pixmap, so we have no other client-side container to cache this mapping in. If you recreate it, we incur all that cost every time. I don’t have better recommendations than sticking with a different path for our driver unfortunately.

As far as the spec, I think the other vendor’s interpretations are invalid, not that it matters if they’re shipping implementations based on their interpretation. I don’t see anything in the EGL_KHR_image_base spec that states an image’s content can become undefined after binding unless the image is respecified by a client API (Meaning its memory requirements are changed somehow, like adding mipmaps, or some GL operation that results in a new texture object with the same name being created), and nothing in the EGL_KHR_image_pixmap spec overrides this. The closest thing that’s there is that if multiple APIs are accessing the image simultaneously and one of them is rendering to it, it’s undefined what the results are everywhere. However, that means actual simultaneous access (read/write races), not write-then-read or read-then-write. Write-then-read is specifically defined as producing visible results in every EGLImage bound to the source, so I maintain this is a bug in other implementations unless they’re using some very loose interpretation of what constitutes a read/write race.

(an aside… i wish this forum would just keep me logged in…).

Here’s my recollection of this issue. Things worked fine, then some update changed symbols and made our porocaddress check of egflcreateimagekhr or something not right (we looked for eglcreateimage first then khr but this messed things up as we didn’t check extn strings too and libglvnd i think messed it up). anyway - this broke our egl/gles. i fixed this up then some time after that i noticed this performance issue as i was then building egl/gles by default as i’d fixed things to work again… i know - not scientific… :(

Anyway - as for spec, I agree with you, but I remember other vendor devs telling me the opposing interpretation. I told them they were wrong. Either way it ended up a reality to deal with. It’s probably a slightly darker corner of the egl world that isn’t trodden on regularly, so not surprising, but perhaps it’s something for nvidia drivers to deal with so we don’t have to work around things if we try the “works everywhere, but slow sometimes in some places” create+destroy path. i’m sure a trivial perf test suite entry in your tests would demonstrate this and allow an improvement, but still it does bring up a bigger picture problem. :(