Digital Video Pipeline (dvp) error on dvpMemcpy*

Attempting to use the dvpMemcpy* functions from the DVPAPI, we’re getting an unknown error when using DVP_HALF_FLOAT or other larger storage formats as the DVPBufferTypes.

We’re using a Quadro 4000 and the copy works as expected with DVP_UNSIGNED_BYTE but not with the larger pixel storage.

Our steps:

  • Initialize the GL context with dvpInitGLContext
  • Obtain the required semaphore data with dvpGetRequiredConstantsGLCtx
  • Create texture(s) in OpenGL
  • Allocate and lock pinnable memory containing the pixel data
  • Create semaphore memory instances for the cpu and gpu (based on dvpGetRequiredConstantsGLCtx data)
  • Use dvpCreateGPUTextureGL to create the DVP handle for the texture(s)
  • Use dvpCreateBuffer to create the system memory handle
  • Use dvpBindToGLCtx to bind the system memory handle to the GL context
  • Pass required args to dvpMemcpyLined within a dvpBegin/dvpEnd block, wrapping it withdvpMapBuffer(Wait|End)DVP.

No matter what we’ve tried, the result of the dvpMemcpy* functions is always a DVP_STATUS_ERROR when the storage format is anything but DVP_UNSIGNED_BYTE. Any tips or ideas would be great! All other function calls to the dvp api return ok and the , it’s just the copy that fails. Perhaps that means we’re missing a step.

We’ve used this pixel data with PBOs for a long while now and are hoping to reduce the copy time on supporting devices, or at least offload the allocating/copying to a thread that can think ahead of the GPU render.

Hi there @mccartneyworks and welcome to the NVIDIA developer forums.

May I ask through which SDK you are using DVP and which version of the DVP API you are using?

That might help narrow down what might be the issue here.

I am not too familiar with this area, but my understanding is that DVP was replaced by or incorporated with GPU Direct for Video and DVP API support might be limited with our latest drivers/firmware.

But I will check if I can find some experts to chime in.

Thanks!

Thanks for your reply!

We’re basing our code off the Blackmagic SDK examples and using the headers/dvp libs from their package. In their OpenGL examples, which seem to operate well on the given card, it’s only ever using the DVP_USIGNED_BYTE format.

As for the use of DVP vs. something more modern, we are happy to use the latest and greatest means of texture throwing on modern hardware. If we can get something that’s supported outside of just the Quadro series and into the RTX 10* 20* or 30* series that would be excellent too!

The DVP version provided with the Blackmagic SDK is 1.70

We’ve been looking for a means to use CUDA to do the pixel copy on pinned memory too and convert/utilize opengl textures asynchronously/in parallel. If this is the modern approach, we can pivot to this. Any example CUDA<->OpenGL code would be a treat.

Cheers

Hi.

Is the ultimate goal to capture from a Blackmagic DeckLink video I/O device into CUDA rather than OpenGL? If so, there is a capture to CUDA example that I have authored that is not yet included in the DeckLink SDK that I can share this with you. Also, the current GPU Direct for Video (DVP) SDK is v2.10, I can share this with you as well. Once you have the latest SDK and the CUDA sample code if this is the desired path we can debug the capture of DVP_HALF_FLOAT.

-tom

Excellent info!

We’re looking to extend the capabilities of our review/NLE platform who’s renderer is written in OpenGL (and a little secret sauce). We deal with film assets in stereo and creatives want to compare against multiple versions (~16K of raw pixels being shuttled per frame yikes!). These images are pre-rendered, not captured.

We want the most efficient way on Quadro and/or RTX cards to go from system memory to the GPU for use in an OpenGL render pipeline, whatever that might be. If it’s just ordinary PBOs from OpenGL then we’ll keep using that! We will be looking into GPUD for nvme/nic transfers in the near future as well, but not quite yet.

This all came about as we’ve been building plugins for various presentation APIs on our platform. Blackmagic has the best documentation/examples we’ve seen. We want to render to SDI output for full frame stereo support. Eventually the goal may to be agnosticize the target graphics library but we’re stuck with ogl for the time being.

Effective Current Pipeline:

File Data -> System Memory -> OpenGL PBOs -> Textures --> Preview Render
                                                      `-> DeckLink Render

If the OpenGL PBO implementation isn’t requiring an additional copy then perhaps this current endeavor is moot, but our goal was to see the performance characteristics of something like:

... Pinned Memory -> DVP/CUDA Shuttle to OpenGL Textures --> Preview Render
                                                         `-> DeckLink Render

Not to mention use this the other way around as well for packing to render out the resulting review timeline. Any information or “no no that’s crazy, don’t do it like that” would be great.

As for the upgraded DVP SDK we would be happy to test it and get back to you with more advanced findings and results.

Cheers

Apologies for the slow follow up on this.

Yea, for OpenGL, ping-pong PBOs is still the best way to use the copy engines on the GPU and upload textures to the GPU (or copy them back). Under the hood, in the case of OpenGL, GPU Direct for Video is using PBOs.

I will share the latest DVP SDK later this week. I apologize, I have been out of the office.

1 Like

No worries on the timing. Thanks again - this is all great data!

Glad to know we were on the right path before. I’ll keep mucking with the engine and see what kind of results I can get with the newer DVP SDK once available. If for no other reason than to set ourselves up for using GPU Direct more effectively once at that stage.

Cheers

Hi,

Was wondering if I could see that capture to cuda example ? Would be interested to see if that utilizes rdma or if its just a sys buffer into a dma into gmem.

While we’re still mucking about with it, we have gathered some possible components. I can’t show any of the specific code we’ve written but for research we’ve been looking into the Blackmagic SDK; More specifically their LoopThroughWithOpenGLCompositing or the DX11 examples.

As for the CUDA elements, we haven’t found any means of directly accessing the pixel data from the capture, so we assume it’s a shuttle to the system mem. In the BM SDK examples, they show off using pinned memory to avoid the additional copy. Ideally using CUDA for that would just be a matter of changing the IDeckLinkInputCallback::VideoInputFrameArrived to move the pixels into your CUDA context. - something super basic like this is where we started. But instead of manually making the pixel data, we just used the IDeckLinkVideoInputFrame::GetBytes(), among other parts of the DeckLink SDK.

If you’re looking to use the DVP API for CUDA (e.g. dvpCreateGPUCUDAArray() and other like functions), we’re assuming it would be a matter of swapping out the specific commands for something like the VideoFrameTransfer class in the BM SDK LoopThroughWithOpenGLCompositing example and a little CUDA-specific magic.

Perhaps this is all known information but I hope some of it helps!

Cheers

Plese accept my apologies.

I need to verify the sample code against the latest BMD SDK download.

Then I need to determine how best to share it.

Please give me a couple of days.

2 Likes

Was wondering if there were any updates on this? This seems like it would be immensely helpful for my live video processing framework. If it works the way I’m thinking than I could just get a cuda pointer to the frame intead of doing a memcpy.

Hey was wondering if you had gotten anywhere with this? No rush or anything just curious