GPU Direct and 30 video cameras

We need to mosaic and do some image clean-up on thirty video streams. (as well as do a little bit of object tracking). We are comfortable with the RANSAC in the CUDA, we worry if this is this architecture will handle the amount of video we are talking about. Tell us if we are crazy about the proposal we are suggesting to our sponsor below:

  • 30FPS 10 bit Black & White video stream (30 cameras @ 30FPS) each sensor about 1 MPixel
  • Total aggregate raw bandwidth works out to abut 2.4 GB/s
  • Matrox tells us to use this PCIe card http://www.matrox.com/video/en/products/developer/hardware/dsx_le4_fh (8 SDI inputs each)
  • We would then use GPUdirect to pull all the video into a Tesla K40 12 GB
  • Processing would be image cleanup, registration, and some simple object recognition
  • We only need about 6 frames at a time, the computation needs to be real-time tracking of objects in the stream, so we have enough onboard RAM

We have a couple of concerns about this (if we cannot use the Telsa K40, we can do a FPGA version)

  • Is the Matrox card the best for this?
  • Can we believe Matrox that if these 4 cards on the same Root Complex we can read full Raw data from all 30 cameras?
  • We really want non 720p or 1080p images. This is a black and white problem, so 12 bits per pixel and 1.3 MP would be nice
  • Does this lock down all the CUDA memory, or can we do computation while the raw video is being double buffered into the CUDA memory

i suppose some relationship between the video streams holds…?

i doubt whether none of the streams are ‘linked’, but i equally doubt whether all are linked…?

Sorry, can’t comment if the Matrox board is the best solution for you as I don’t have any experience with this particular card. The bottleneck will certainly be in the x4 or x8 PCIE connection. Are the cameras really SDI cameras or something like Camera Link?

GPU Direct for Video does not lock down all the CUDA memory, only the device memory bound for the transfers. So, with the dual-copy engines on the K40 computation should run in parallel with video transfers to the GPU.

how many of the video cards do you need - 4? (30 / 8 -> 4)

if so, you wish to slot 5 pci cards in total - the additional gpu

even if you use a server motherboard with more than 4 pci slots, would this not still boil down to max 4 pci slots per cpu, and does gpudirect not require devices to be on the same bus?

VideoGuru: thanks on the tip about VideoDirect. BTW? Is there a difference between RDMA and VideoDirect? The BitFlow sales engineer thought so (and they also have a board).

Yes. The videos are “linked” in that they all overlap and look at a large scene. But I suspect we might not have to mosaic them and do the perspective warps and aligns (But having the Tesla and the image libraries will help us to test our theories). What I would like to do is GENLOCK all the cameras, since SDI, CAMERALink, etc. does not have any sync ability.

GPU Direct says that it only requires all frame grabbers to be on the same Root Complex with the TESLA.

We have flexibility on the camera interface. We like CameraLink, GigE Vision, USB, or whatever works.

From the comments above it seems the issue might be the number of frame grabbers that we can have together with the TESLA and CPU on the same motherboard. Matrox Imaging has a 4 input device (so we would need 8 PCIex4 cards. But they say with this souped up mother board with switching, it can handle that number of cards: http://www.matrox.com/imaging/en/products/vision_systems/supersight/

TAKE A LOOK AT THE SPECS. 14 pcie cards: http://www.matrox.com/imaging/media/pdf/products/supersight/supersight_e2.pdf

wow, that particular pci backplane almost has more pci switches than transistors; and to top it, it is pci 2

so, yes, your representative is correct in that the rack can slot 9+ pci cards; but i think you should have him confirm that it would yield your required pci bandwidth, given the way you intent to use the rack - the number of grabbers you wish to slot and channel to a single gpu device

the rack seems like a typical case where they sacrifice bandwidth for number of slots

even if the streams overlap, do you really need all streams’ data to process individual streams; i would think that only the borders would suffice

personally, i would give distributed processing or interleaved processing as much consideration as centralized processing
and perhaps video splitters - generating multiple instances of the same stream - can further yield more design flexibility

if you can amalgamate (some of) the streams beforehand, you may perhaps already be in a better position, as you then might decrease the number of grabbers, which in turn might move you back to a high-end ‘standard’ (server) motherboard with fewer slots/ greater bandwidth

perhaps you can source a high-end video multiplexer for this purpose

equally, more ‘open’ and ‘freely-configurable’ protocols, like ethernet and usb from your list, might have a comparable effect; ethernet ‘stacks’ very easily

ethernet (cameras) can equally broadcast streams to multiple drains i would think

Total bandwidth of all raw streams should be about 1.2GB/s. And with onboard FPGA processing on the frame grabber we might be able to reduce that if needed. I see people claiming real PCIe gen 3 bandwidth of 6GB/s, so I am less concerned about the bandwidth.

We do need to track down the following issues:

  • Is there enough BAR addresses for 8 cards?
  • Can I do the RDMA directly to CUDA memory, or do I have to go to OpenGL and then do context switch? (see below)

I got a private message that made me worry about the RDMA and the need for moving stuff from OpenGL context. Can anyone verify this is an issue with CUDA V6 and Tesla K40?

There is some coding complexities in grabbing the frames into a CUDA buffer, 
since this cannot be done directly, the buffer must first enter 
via an OpenGL graphics context and then be transferred to
 a "CUDA buffer" via OpenGL/CUDA interop calls, this required 
quite a few lines of graphics related code.

your bandwidth calculation is conditional on the number of bytes to store a pixel: you make mention of 10 bits and 12 bits per pixel, and you indeed base your bandwidth calculation on 10 bits per pixel
if you base it on byte multiples, you would find that your bandwidth at minimum is around 1.8GB/s

in that i think perhaps is the answer to your second question: how the grabber stores frames, and what would be acceptable to the device/ cuda: a pixel as 2 bytes min, likely 4 bytes (int or float)

Hi,
I have a quadro k5200 gpu card,and 2 quadro capture card ,I need to transfer capture frame data to GPU,and handle the frame data.Where can I download the sample code ,or gpudirect sdk ?
thanks.