Attempting to use the
dvpMemcpy* functions from the
DVPAPI, we’re getting an unknown error when using
DVP_HALF_FLOAT or other larger storage formats as the
We’re using a Quadro 4000 and the copy works as expected with
DVP_UNSIGNED_BYTE but not with the larger pixel storage.
- Initialize the GL context with
- Obtain the required semaphore data with
- Create texture(s) in OpenGL
- Allocate and lock pinnable memory containing the pixel data
- Create semaphore memory instances for the cpu and gpu (based on
dvpCreateGPUTextureGL to create the DVP handle for the texture(s)
dvpCreateBuffer to create the system memory handle
dvpBindToGLCtx to bind the system memory handle to the GL context
- Pass required args to
dvpMemcpyLined within a
dvpEnd block, wrapping it with
No matter what we’ve tried, the result of the
dvpMemcpy* functions is always a
DVP_STATUS_ERROR when the storage format is anything but
DVP_UNSIGNED_BYTE. Any tips or ideas would be great! All other function calls to the dvp api return ok and the , it’s just the copy that fails. Perhaps that means we’re missing a step.
We’ve used this pixel data with PBOs for a long while now and are hoping to reduce the copy time on supporting devices, or at least offload the allocating/copying to a thread that can think ahead of the GPU render.
Hi there @mccartneyworks and welcome to the NVIDIA developer forums.
May I ask through which SDK you are using DVP and which version of the DVP API you are using?
That might help narrow down what might be the issue here.
I am not too familiar with this area, but my understanding is that DVP was replaced by or incorporated with GPU Direct for Video and DVP API support might be limited with our latest drivers/firmware.
But I will check if I can find some experts to chime in.
Thanks for your reply!
We’re basing our code off the Blackmagic SDK examples and using the headers/dvp libs from their package. In their OpenGL examples, which seem to operate well on the given card, it’s only ever using the
As for the use of DVP vs. something more modern, we are happy to use the latest and greatest means of texture throwing on modern hardware. If we can get something that’s supported outside of just the Quadro series and into the RTX 10* 20* or 30* series that would be excellent too!
The DVP version provided with the Blackmagic SDK is
We’ve been looking for a means to use CUDA to do the pixel copy on pinned memory too and convert/utilize opengl textures asynchronously/in parallel. If this is the modern approach, we can pivot to this. Any example CUDA<->OpenGL code would be a treat.
Is the ultimate goal to capture from a Blackmagic DeckLink video I/O device into CUDA rather than OpenGL? If so, there is a capture to CUDA example that I have authored that is not yet included in the DeckLink SDK that I can share this with you. Also, the current GPU Direct for Video (DVP) SDK is v2.10, I can share this with you as well. Once you have the latest SDK and the CUDA sample code if this is the desired path we can debug the capture of DVP_HALF_FLOAT.
We’re looking to extend the capabilities of our review/NLE platform who’s renderer is written in OpenGL (and a little secret sauce). We deal with film assets in stereo and creatives want to compare against multiple versions (~16K of raw pixels being shuttled per frame yikes!). These images are pre-rendered, not captured.
We want the most efficient way on Quadro and/or RTX cards to go from system memory to the GPU for use in an OpenGL render pipeline, whatever that might be. If it’s just ordinary PBOs from OpenGL then we’ll keep using that! We will be looking into GPUD for nvme/nic transfers in the near future as well, but not quite yet.
This all came about as we’ve been building plugins for various presentation APIs on our platform. Blackmagic has the best documentation/examples we’ve seen. We want to render to SDI output for full frame stereo support. Eventually the goal may to be agnosticize the target graphics library but we’re stuck with ogl for the time being.
Effective Current Pipeline:
File Data -> System Memory -> OpenGL PBOs -> Textures --> Preview Render
`-> DeckLink Render
If the OpenGL PBO implementation isn’t requiring an additional copy then perhaps this current endeavor is moot, but our goal was to see the performance characteristics of something like:
... Pinned Memory -> DVP/CUDA Shuttle to OpenGL Textures --> Preview Render
`-> DeckLink Render
Not to mention use this the other way around as well for packing to render out the resulting review timeline. Any information or “no no that’s crazy, don’t do it like that” would be great.
As for the upgraded DVP SDK we would be happy to test it and get back to you with more advanced findings and results.
Apologies for the slow follow up on this.
Yea, for OpenGL, ping-pong PBOs is still the best way to use the copy engines on the GPU and upload textures to the GPU (or copy them back). Under the hood, in the case of OpenGL, GPU Direct for Video is using PBOs.
I will share the latest DVP SDK later this week. I apologize, I have been out of the office.
No worries on the timing. Thanks again - this is all great data!
Glad to know we were on the right path before. I’ll keep mucking with the engine and see what kind of results I can get with the newer DVP SDK once available. If for no other reason than to set ourselves up for using GPU Direct more effectively once at that stage.
Was wondering if I could see that capture to cuda example ? Would be interested to see if that utilizes rdma or if its just a sys buffer into a dma into gmem.
While we’re still mucking about with it, we have gathered some possible components. I can’t show any of the specific code we’ve written but for research we’ve been looking into the Blackmagic SDK; More specifically their
LoopThroughWithOpenGLCompositing or the
As for the CUDA elements, we haven’t found any means of directly accessing the pixel data from the capture, so we assume it’s a shuttle to the system mem. In the BM SDK examples, they show off using pinned memory to avoid the additional copy. Ideally using CUDA for that would just be a matter of changing the
IDeckLinkInputCallback::VideoInputFrameArrived to move the pixels into your CUDA context. - something super basic like this is where we started. But instead of manually making the pixel data, we just used the
IDeckLinkVideoInputFrame::GetBytes(), among other parts of the DeckLink SDK.
If you’re looking to use the DVP API for CUDA (e.g.
dvpCreateGPUCUDAArray() and other like functions), we’re assuming it would be a matter of swapping out the specific commands for something like the
VideoFrameTransfer class in the BM SDK
LoopThroughWithOpenGLCompositing example and a little CUDA-specific magic.
Perhaps this is all known information but I hope some of it helps!
Plese accept my apologies.
I need to verify the sample code against the latest BMD SDK download.
Then I need to determine how best to share it.
Please give me a couple of days.
Was wondering if there were any updates on this? This seems like it would be immensely helpful for my live video processing framework. If it works the way I’m thinking than I could just get a cuda pointer to the frame intead of doing a memcpy.
Hey was wondering if you had gotten anywhere with this? No rush or anything just curious
Hello. I would also be interested in the DVP->CUDA example. Thank you!
Me too! Why don’t they have a video->CUDA buffer direct example?