GPU Direct and 30 video cameras

DoctorG · October 30, 2014, 3:34pm

We need to mosaic and do some image clean-up on thirty video streams. (as well as do a little bit of object tracking). We are comfortable with the RANSAC in the CUDA, we worry if this is this architecture will handle the amount of video we are talking about. Tell us if we are crazy about the proposal we are suggesting to our sponsor below:

30FPS 10 bit Black & White video stream (30 cameras @ 30FPS) each sensor about 1 MPixel
Total aggregate raw bandwidth works out to abut 2.4 GB/s
Matrox tells us to use this PCIe card http://www.matrox.com/video/en/products/developer/hardware/dsx_le4_fh (8 SDI inputs each)
We would then use GPUdirect to pull all the video into a Tesla K40 12 GB
Processing would be image cleanup, registration, and some simple object recognition
We only need about 6 frames at a time, the computation needs to be real-time tracking of objects in the stream, so we have enough onboard RAM

We have a couple of concerns about this (if we cannot use the Telsa K40, we can do a FPGA version)

Is the Matrox card the best for this?
Can we believe Matrox that if these 4 cards on the same Root Complex we can read full Raw data from all 30 cameras?
We really want non 720p or 1080p images. This is a black and white problem, so 12 bits per pixel and 1.3 MP would be nice
Does this lock down all the CUDA memory, or can we do computation while the raw video is being double buffered into the CUDA memory

little_jimmy · November 1, 2014, 11:45am

i suppose some relationship between the video streams holds…?

i doubt whether none of the streams are ‘linked’, but i equally doubt whether all are linked…?

VideoGuru · November 3, 2014, 5:43am

Sorry, can’t comment if the Matrox board is the best solution for you as I don’t have any experience with this particular card. The bottleneck will certainly be in the x4 or x8 PCIE connection. Are the cameras really SDI cameras or something like Camera Link?

GPU Direct for Video does not lock down all the CUDA memory, only the device memory bound for the transfers. So, with the dual-copy engines on the K40 computation should run in parallel with video transfers to the GPU.

little_jimmy · November 3, 2014, 6:38am

how many of the video cards do you need - 4? (30 / 8 → 4)

if so, you wish to slot 5 pci cards in total - the additional gpu

even if you use a server motherboard with more than 4 pci slots, would this not still boil down to max 4 pci slots per cpu, and does gpudirect not require devices to be on the same bus?

DoctorG · November 4, 2014, 4:30pm

VideoGuru: thanks on the tip about VideoDirect. BTW? Is there a difference between RDMA and VideoDirect? The BitFlow sales engineer thought so (and they also have a board).

Yes. The videos are “linked” in that they all overlap and look at a large scene. But I suspect we might not have to mosaic them and do the perspective warps and aligns (But having the Tesla and the image libraries will help us to test our theories). What I would like to do is GENLOCK all the cameras, since SDI, CAMERALink, etc. does not have any sync ability.

GPU Direct says that it only requires all frame grabbers to be on the same Root Complex with the TESLA.

We have flexibility on the camera interface. We like CameraLink, GigE Vision, USB, or whatever works.

From the comments above it seems the issue might be the number of frame grabbers that we can have together with the TESLA and CPU on the same motherboard. Matrox Imaging has a 4 input device (so we would need 8 PCIex4 cards. But they say with this souped up mother board with switching, it can handle that number of cards: Smart Cameras, 3D Sensors & Vision Controllers | Systems | Matrox Imaging

TAKE A LOOK AT THE SPECS. 14 pcie cards: http://www.matrox.com/imaging/media/pdf/products/supersight/supersight_e2.pdf

little_jimmy · November 5, 2014, 11:47am

wow, that particular pci backplane almost has more pci switches than transistors; and to top it, it is pci 2

so, yes, your representative is correct in that the rack can slot 9+ pci cards; but i think you should have him confirm that it would yield your required pci bandwidth, given the way you intent to use the rack - the number of grabbers you wish to slot and channel to a single gpu device

the rack seems like a typical case where they sacrifice bandwidth for number of slots

even if the streams overlap, do you really need all streams’ data to process individual streams; i would think that only the borders would suffice

personally, i would give distributed processing or interleaved processing as much consideration as centralized processing
and perhaps video splitters - generating multiple instances of the same stream - can further yield more design flexibility

little_jimmy · November 5, 2014, 1:17pm

if you can amalgamate (some of) the streams beforehand, you may perhaps already be in a better position, as you then might decrease the number of grabbers, which in turn might move you back to a high-end ‘standard’ (server) motherboard with fewer slots/ greater bandwidth

perhaps you can source a high-end video multiplexer for this purpose

equally, more ‘open’ and ‘freely-configurable’ protocols, like ethernet and usb from your list, might have a comparable effect; ethernet ‘stacks’ very easily

ethernet (cameras) can equally broadcast streams to multiple drains i would think

DoctorG · November 5, 2014, 3:21pm

Total bandwidth of all raw streams should be about 1.2GB/s. And with onboard FPGA processing on the frame grabber we might be able to reduce that if needed. I see people claiming real PCIe gen 3 bandwidth of 6GB/s, so I am less concerned about the bandwidth.

We do need to track down the following issues:

Is there enough BAR addresses for 8 cards?
Can I do the RDMA directly to CUDA memory, or do I have to go to OpenGL and then do context switch? (see below)

I got a private message that made me worry about the RDMA and the need for moving stuff from OpenGL context. Can anyone verify this is an issue with CUDA V6 and Tesla K40?

There is some coding complexities in grabbing the frames into a CUDA buffer, 
since this cannot be done directly, the buffer must first enter 
via an OpenGL graphics context and then be transferred to
 a "CUDA buffer" via OpenGL/CUDA interop calls, this required 
quite a few lines of graphics related code.

little_jimmy · November 6, 2014, 12:50pm

your bandwidth calculation is conditional on the number of bytes to store a pixel: you make mention of 10 bits and 12 bits per pixel, and you indeed base your bandwidth calculation on 10 bits per pixel
if you base it on byte multiples, you would find that your bandwidth at minimum is around 1.8GB/s

in that i think perhaps is the answer to your second question: how the grabber stores frames, and what would be acceptable to the device/ cuda: a pixel as 2 bytes min, likely 4 bytes (int or float)

leef918 · July 14, 2017, 1:21pm

Hi,
I have a quadro k5200 gpu card,and 2 quadro capture card ,I need to transfer capture frame data to GPU,and handle the frame data.Where can I download the sample code ,or gpudirect sdk ?
thanks.

Topic		Replies	Views
TX1 camera data to GPU memory directly Jetson TX1	3	3018	May 27, 2016
Using mobo built in video? CUDA Programming and Performance	14	13629	April 19, 2008
Device Memory Bandwidth CUDA Programming and Performance	17	8702	January 17, 2018
GPUDirect and RTX 4090 TensorRT camera , gstreamer	1	1244	August 29, 2023
Custom PCI-Ex FPGA board - DMA - Cuda CUDA Programming and Performance	4	1926	October 12, 2011
Is there a way that the captured data of decklink card can transfer directly into video memory? Linux	2	1265	May 22, 2019
The fastest platform of GPU computing CUDA Programming and Performance	38	40829	January 19, 2010
Direct data transfer from non-NVIDIA PCIe capture card to GPU card CUDA Programming and Performance	6	5390	August 30, 2010
From NIC to GPU. CUDA Programming and Performance	42	14288	August 21, 2025
Using GPUdirect for video with Mellanox ConnectX CUDA Programming and Performance	1	446	April 14, 2024

GPU Direct and 30 video cameras

Related topics