Processing Windows GDI drawing with CUDA Finding the way to do so at high speed

Hello partners.

I have read a lot of topics concerning screenshot image processing with CUDA, also review some SDK examples. All of them use pixel or buffer buffer objects, taking advantage of CUDA interoperability with these openGL’s objects.

I’m currently working on a project where I need to perform a screenshot of the entire window desktop and do some processing (that’s where CUDA comes in) without rendering anything. So my first idea was to create an openGL context associated with windows desktop using the WGL functions, then create a pixel buffer object and read desktop context into this buffer with glReadPixels, map this PBO to CUDA and so on. Using PBO would lead to several benefits in my running time, at least that’s what I have read.

But there is no way I can create an openGL rendering context associated with windows desktop that support buffers extensions. Docs about this says this is because the Microsoft’s generic implementation of openGL correspond to version 1.1.0 where no buffers objects existed. I have tried to make it work with the openGL implementation of my NVIDIA video card (wich correspond to version 2.1.2) but this only works with a window other than desktop. OPENGL forums don’t say anything about this, but it seems that a window that support GDI drawing (desktop is certainly one of the kind) will always lead to the generic implementation rather than to the one provided by the graphic driver.

Does somebody have any strategy in mind about how to carry Windows GDI drawing to CUDA using openGL at a reasonable speed?

In a 102476832bpp configuration it means moving 3 Mb of info, so copying it first to DRAM and then back to CUDA mem space would produce a very high latency.

Is there an answer in Direct3D? (never work with it before)

Thanks in advance!

I’ll expose the solution I’ve found.

  • Create a full transparent window with the dimension of the desktop, using the extended styles WS_EX_LAYERED and WS_EX_TOPMOST, and with the style WS_POPUP. The window should cover the entire screen, whose dimensiones can be retrieved using API functions such as “EnumDisplaySettings”. The class of this function may not specify a background color, icon, cursor and menu (set this values to NULL).
  • Get the window’s device context, set an adapter-native single-buffered pixel format (can’t be generic!), create a rendering context from the device context, and make it current.
  • Check the current version of OpenGL, glGetString(GL_VERSION) must return 2.1.x or above, if it return 1.1.0 then we set a generic pixel format and can’t use buffer extensions.
  • Init GLEW, create and initialize a pixel buffer object with the size of the transparent window.

Now any time we need a desktop-shot just do “glBindBuffer(GL_PIXEL_PACK_BUFFER, PBO)” and “glReadPixels(…, NULL)” with the rendering context of the window being current.

To process it with CUDA, register and map the pixel buffer object.

There is a drawback of this approach, I can only get an image of the main plane, if an application is using the overlay plane then it appear as a black rectangle, wich is expected since in the pixel format we must specify the plane OpenGL is going to work with. The other thing is that I don’t know if it works with full-screen apps.

That’s the way I can currently post-process the Window’s desktop.

There is no way to catch the overlay plane, it is only rendered on the way out to the screen so windows never sees it.

Also, are you sure that the data is transferred directly between opengl and cude via the pbo? I saw a peak in cpu usage during the transfer (a very big cpu overhead for some reason) and some people reported a delay that seems to indicate that the data is copied back to the cpu before being transferred to CUDA

would a fbo work here as well and are the advantages/disadvantages to that ?

I agree with you.

I think so. The buffers specification says OpenGL may allocate POB’s memory inside the video card. If it’s the case, the transfer mean a copy inside the video card, from the openGL frameBuffer to the PBO memory. Mapping the PBO with CUDA means “make PBO entries accesibles from CUDA context”.

If the PBO memory is allocated in client-side memory (regular RAM) then a delay must exist because info should travel from video card to ram through PCI-E bus. In my experiments, I obtained the following results:

with a resolution 1024 x 768 x 32bpp: more than 160 frames per second transfered from windows desktop to cuda context (no kernel is launched but PBO mapping is done)

with 1280 x 1024 x 32bpp: more than 140 FPS.

in both cases I read the entire desktop content.

I see an increase in CPU usage but I think it’s normal as I do nothing else than read-after-read (including frames-per-second calculation) in an infinite loop until I close the app.

As I have read, OpenGL decides whether to use video ram or regular ram to allocate the PBO, we can just give it a “clue” about what we want the PBO for.

I don’t think so, because desktop content is rendered by window’s GDI and have nothing to do with openGL, they (GDI and openGL) only share the same framebuffer in the video card. How can you instruct windows to draw to an offscreen frame buffer?

Creating a full screen transparent windows means that it cover the entire framebuffer, so reading it’s content is the same that reading desktop content. Also, being transparent means you can work normally and you don’t see the window, althoug I have not tried what happes if another on-top window is launched.

I will make a report including source code in few days.