EGL_EXT_image_dma_buf_import broken - EGL_BAD_ALLOC with tons of free RAM

Recently I upgraded my Linux NVidia binary driver to 455.45.01 which has now started advertising EGL_EXT_image_dma_buf_import support, unfortunatly it seems to be broken, the following code which is working fine on Intel and AMD GPUs is not working:

GLAttrib const attribs[] =                                                     
{                                                                               
  EGL_WIDTH                         , 1920,
  EGL_HEIGHT                        , 1200,
  EGL_LINUX_DRM_FOURCC_EXT          , DRM_FORMAT_ARGB8888,
  EGL_DMA_BUF_PLANE0_FD_EXT         , dmaFd,
  EGL_DMA_BUF_PLANE0_OFFSET_EXT     , 0,
  EGL_DMA_BUF_PLANE0_PITCH_EXT      , 7680,
  EGL_NONE
};

EGLImage image = eglCreateImage(                                                
  texture->display,                                                             
  EGL_NO_CONTEXT,                                                               
  EGL_LINUX_DMA_BUF_EXT,                                                        
  (EGLClientBuffer)NULL,                                                        
  attribs                                                                       
);      

The fourcc in use eglQueryDmaBufFormatsEXT reports my GPU supports (Quadro K1200). This fails with EGL_BAD_PARAMETER.

This very much looks like a driver bug as I have dug through the Kronos documentation over and over again and can’t find a reason why this would fail only on NVidia hardware.

Additional information that may be of use:

EGL       : 1.5
Vendor    : NVIDIA Corporation
Renderer  : Quadro K1200/PCIe/SSE2
Version   : OpenGL ES 3.2 NVIDIA 455.45.01
EGL APIs  : OpenGL_ES OpenGL
Extensions: EGL_EXT_buffer_age EGL_EXT_client_sync EGL_EXT_create_context_robustness EGL_EXT_image_dma_buf_import EGL_EXT_image_dma_buf_import_modifiers EGL_MESA_image_dma_buf_export EGL_EXT_output_base EGL_EXT_stream_acquire_mode EGL_EXT_sync_reuse EGL_IMG_context_priority EGL_KHR_config_attribs EGL_KHR_create_context_no_error EGL_KHR_context_flush_control EGL_KHR_create_context EGL_KHR_fence_sync EGL_KHR_get_all_proc_addresses EGL_KHR_partial_update EGL_KHR_swap_buffers_with_damage EGL_KHR_no_config_context EGL_KHR_gl_colorspace EGL_KHR_gl_renderbuffer_image EGL_KHR_gl_texture_2D_image EGL_KHR_gl_texture_3D_image EGL_KHR_gl_texture_cubemap_image EGL_KHR_image EGL_KHR_image_base EGL_KHR_image_pixmap EGL_KHR_reusable_sync EGL_KHR_stream EGL_KHR_stream_attrib EGL_KHR_stream_consumer_gltexture EGL_KHR_stream_cross_process_fd EGL_KHR_stream_fifo EGL_KHR_stream_producer_eglsurface EGL_KHR_surfaceless_context EGL_KHR_wait_sync EGL_NV_nvrm_fence_sync EGL_NV_post_sub_buffer EGL_NV_quadruple_buffer EGL_NV_stream_consumer_eglimage EGL_NV_stream_cross_display EGL_NV_stream_cross_object EGL_NV_stream_cross_process EGL_NV_stream_cross_system EGL_NV_stream_dma EGL_NV_stream_flush EGL_NV_stream_metadata EGL_NV_stream_remote EGL_NV_stream_reset EGL_NV_stream_socket EGL_NV_stream_socket_inet EGL_NV_stream_socket_unix EGL_NV_stream_sync EGL_NV_stream_fifo_next EGL_NV_stream_fifo_synch

Note: I have also had reports of this exact code failing on RTX2070s also.

Edit: After upgrading to 460.27.04 the error has changed to EGL_BAD_ALLOC, which implies not enough free memory, however nvidia-smi reports I am only consuming 1061/4041 MiB
Edit 2: The open-source nouveau driver performs DMA just fine using this code, indicating it’s certainly a driver bug and not a hardware limitation.

Upon further examination the implementation is simply missing, so why do nvidia drivers advertise support for EGL_EXT_image_dma_buf_import?

nvidia/nv-dma.c:

NV_STATUS NV_API_CALL nv_dma_import_dma_buf
(
    nv_dma_device_t *dma_dev,
    struct dma_buf *dma_buf,
    NvU32 *size,
    void **user_pages,
    nv_dma_buf_t **import_priv
)
{
    return NV_ERR_NOT_SUPPORTED;
}

NV_STATUS NV_API_CALL nv_dma_import_from_fd
(
    nv_dma_device_t *dma_dev,
    NvS32 fd,
    NvU32 *size,
    void **user_pages,
    nv_dma_buf_t **import_priv
)
{
    return NV_ERR_NOT_SUPPORTED;
}

void NV_API_CALL nv_dma_release_dma_buf
(
    void *user_pages,
    nv_dma_buf_t *import_priv
)
{
}

After NVidia’s annoucement and news of adding DMABUF support for the 470 drivers, this is still not the case, attempting to create an EGLImage with a dmafd still results in EGL_BAD_ALLOC.

One of our community members managed to elicit a response from a NVidia staffer on this issue and here is the response, not sure why this thread itself was not responded to as it was directly referenced in the query and this person also looked at this thread as evidenced in the reply below.

The below message was forwarded to me on the 9th of July 2021.

Hi REDACTED,

Thanks for reaching out. EGL_EXT_image_dma_buf_import is supported by the 470 driver, and in fact several versions before that as well. Looking at the code snippet on the forum post, where is the dmabuf coming from in that case? Like how was it allocated?

As you pointed out, Xwayland relies on that extension to import buffers from client applications, but in that case the actual dmabuf is allocated internally by the client-side GL / Vulkan driver. I don’t believe we currently expose any mechanism for applications to allocate a dmabuf themselves.

Also, the code in nv-dma.c isn’t really relevant. It’s actually nvidia-drm-gem-nvkms-memory.c that handles this path in concert with the secret code in libnvidia-glsi.so

Thanks,
REDACTED

This was responded to on the same day with the code that allocates the DMA-BUF and we have as of yet had no response.

Based on the wording here and the GPL issues associated with DMA-BUF, unless this is actually a driver bug I speculate that NVidia is implementing what looks like a DMA-BUF but is really something internal to the NVidia driver itself and does not actually conform to the specification of a DMA-BUF:

https://01.org/linuxgraphics/gfx-docs/drm/driver-api/dma-buf.html

The three main components of this are: (1) dma-buf, representing a sg_table and exposed to userspace as a file descriptor to allow passing between devices, (2) fence, which provides a mechanism to signal when one device as finished access, and (3) reservation, which manages the shared or exclusive fence(s) associated with the buffer.

If this is the case, then the NVidia drivers do not conform to the specification for EGL_EXT_image_dma_buf_import and advertising such support is invalid.

eglExportDMABUFImageMESA() which would allow a dma_buf of a texture object to cross process boundaries, also no longer seems to work (did it ever work? there are some posts that suggest it has before).

Fully in agreement this should be pulled from supported extensions.