real time 1080p video processing


I am using CUDA to process some full 1080p HD video stream. What is the best strategy to do this kind of work? Rightnow I am doing this:

while( frame ){

  1. using Windows API (either the old VFW or the new MediaFoundation) to read frame i in CPU
  2. copy frame i to the GPU memory with CudaMemCpy
  3. process the frame i within CUDA
  4. copy the output back to CPU
  5. render the output frame with OpenGL

However, I found that step1 takes a tremendous amount of time comparing to the rest steps. For an uncompressed video, 2G Byte with 500 frames, loading each frame in CPU may take around 100ms . Does anyone have similar experience?

I’ve never done it with an uncompressed video format; I think if you’re working with video frames that large it’s going to be pretty slow no matter what. If you used compressed video formats, reading the frame into memory would be faster, and you could use the CUVID decoder for the decoding. One added benefit of using the CUVID decoder is that you could do the processing and display without copying the frame data back to the host - you can do everything on the GPU beyond reading the frame initially.

Hello is it possible we see the code you typed

I’m working with HD (1920x1080) imagery, and you have to understand I/O bounds to get the problem (and solution) here. 1920x1080 pixels x 3 bytes/pixel (1 byte per RGB) x 30 frames per second =~ 180 MB/sec. No single (non-SSD) hard drive can support this rate (much less GB ethernet). In order to get to realistic rates of what the CameraLink (up to 6 Gb/sec) or HD-SDI (1.5 Gb/sec) input can support, you need to pull the whole video stream into RAM, then move the frames over to the graphics card one at a time to emulate the ‘real’ video stream rate. Solid-State hard drives have better I/O rates, and you may be able to directly emulate an HD source if you read the data in from one of those instead.

That being said, once I pull the data into host (CPU) memory, I can emulate ‘live’ camera data at 200+ frames/sec.

Hope this answer’s the question you’re asking. Don’t know whether you’ll have the RAM needed to load all 500 frames in memory (2 MP x 3 Bpp x 500 frames = 3 GB, requires 64-bit OS for that much single process RAM).

BTW, why bring it back out to the CPU? You can render in OpenGL directly from the card with the CUDA interop? I’m doing simple stuff, so I didn’t need to rewrite my code to accomodate the interop. However, if you’re modifying existing OpenGL code to use CUDA for part of it, I can understand it may be difficult to modify.

i am new to the CUDA stuff but what about the new GPUdirect? would that solve some problems with the transfer from cameralink to the gpu?

I would also be very interested in an answer to that question (what is the fastest way to transfer data from a CameraLink card to the GPU). Is there any way to directly transfer data between CameraLink card and GPU without going through the host memory?