Any example on real time video processing

I am looking for an introductory example on how to use CUDA+OPENGL to do real-time video processing. Seems that CUDA SDK only contains a couple of image processing example. Any input is appreciated.

Thanks

I am trying to read AVI files. If there is any AVI decoder example, that should also be great.

Look at the “CUDA Video Decoder GL API” sample in the SDK. It decodes the video frames with the cuvid API decoder, after which you can map the decoded frame to device memory or host memory and perform some post processing on it. With that said, if you absolutely need to do AVI files, you’re out of luck as far as the cuvid decoder goes; at the moment, I believe it only does MPEG-1, 2 (and maybe h.264).

Thank you for your reply.

I looked at that example. Seems that it is based on driver API instead of runtime API (I guess maybe due to the faster performance if decoding is done entirely on the hardware itself?) Anyway, I am a complete newbie and I only know about runtime API.

I double checked my AVI video, seems that it is uncompressed~!!~ YAH ~!! This is what I did:

I read the AVI with VFW lib (frame by frame) under Windows. I display/process the currentframe with CUDA(texture Memory)+OpenGL and then render the frame in OpenGL. The code works, however, I really wanna optimize its performance since I will be working on super high resolution videos (1920*1280). Here is part of my display callback code under glutMainLoop() and I have perfomance questions.

// I have a simple class called AVIFile which can read the AVIFile and output the currentframe with the memberfunction ReturnCurrentFrame()

class AVIFile

{

uchar4 * ReturnCurrentFrame();

}

// In main(), I create a AVIFile pointer and I declare the device memory frameDev in the main()

AVIFile *videoAVI = new ....

CUDAMallocArray(&frameDev, &uchar4Desc, width, height);

texture<uchar4,2> texFrame; // texture memory

/////////////////////////////////////

// display callback

/////////////////////////////////////

void displayFunc(void){

// output frame

unsigned int *outputFrame;

// copy new frame to device memory

cudaMemcpyToArray(frameDev,0,0,videoAVI->ReturnCurrentFrame(),size,cudaMemcpyHostToDevice);

// Map resources

cudaGraphicsMapResources(1, &cuda_pbo_resource, 0)

cudaGraphicsResourceGetMappedPointer((void**)&outputFrame, &num_bytes, cuda_pbo_resource));

// bind device memory to texture memory

cudaBindTextureToArray(texFrame,frameDev);

// CUDA Filtering

Filter(outputFrame);

// Unbind texture

cudaUnbindTexture(texImage);

// Unmap resources

cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0)

// OPENGL rendering

    glClear(GL_COLOR_BUFFER_BIT);

	glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, videoAVI->GetWidth(), videoAVI->GetHeight(), GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_DATA(0) );

    glBegin(GL_TRIANGLES);

        glTexCoord2f(0, 0); glVertex2f(-1, -1);

        glTexCoord2f(2, 0); glVertex2f(+3, -1);

        glTexCoord2f(0, 2); glVertex2f(-1, +3);

    glEnd();

    glFinish();

    glutSwapBuffers();

//next frame

videoAVI->Go2NextFrame();

}

How do I further optimize the performance? Since I haven’t seen any video processing example with runtime API. I am not sure my code flow is the correct one. Like is it necessary to do the following in each display callback:

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Is this the correct order? Is there any steps that I can bypass or simplify?

Regarding each component, I have 2 more questions:

  1. In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

  2. I have been using the above OPENGL Rendering part for a while, I am not an OpenGL expert either. Can someone tell me if what I was doing above (opengl rendering part) is correct?

Thank you so much for reading

I’s not sure if there are any examples of doing video processing with the runtime api. This is a consequence of the cuvid library using the driver api; ordinarily you’d want to use cuvid if you were decoding frames, but since yours are already decompressed, I guess it’s not necessary. I know very little about OpenGL (trying to learn at the moment), so I’m not sure I’d be of much help. Of what I see of your code, it seems to be fine. What kind of FPS are you getting?

Thanks. The FPS is not bad. When playing 1920p video, I noticed that the frame freezes after a while and then jump right back to the beginning. Say the video has 100 frames, after playing 80 of them, the screen freezes at frame 80 and then jump back to frame 1 after a little while. What might be the reason for this behavior? The video I am playing is over 2G. Is it possible that the memory is running out?

Hard to say offhand why that’s happening; are you buffering the frames? And if so, is it a bounded buffer? When you play the video, do you loop it or anything, or do you only play it once?

I once posted a sample where I combined the BoxFilter SDK sample with a realtime video grabber library. I think the .zip file download is now broken, but I could dig it out again if needed.

While blurring a web cam image may be of limited use, it is one example of realtime video processing.

I don’t know too much about openGL. What does “buffering the frames” mean?

I think my main concern is whether my work flow in each display callback:

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Can I further optimize this work flow? In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

I found two related links dicussing similar perfomance issues.

Hello, thank you for your reply. I am very interested in your code, especially your work flow.

I think my main concern is whether my work flow in each display callback is the best one.

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Can I further optimize this work flow? In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

I found two related links dicussing similar perfomance issues.

I found something wierd. I understand that the first run of the CUDA code is slow, on a full 1280HD (1920x1280), in the first run it only achieves 20 FPS. It’s fine. In the second run, it can achieve 60 FPS, but after a little while, it drops back to 20 FPS and never comes back. I think it might due to something else, maybe the memory is running out?

I think I found the problem (not solution yet). I am reading AVI file with windows function AVIStreamGetFrame(). I think this function has memory issues, it becomes incredibly slow after say reading 20 frames from it.

Hi,

A good solution that worked for me was to use Microsoft DirectShow filters. You can start with a transform filter where you can incorporate your CUDA functions. And instead of transferring back each frame to the CPU you can use the Direct3D or OpenGL interoperability. The output of the transform filter just has to be connected to a NULL renderer in that case. Or you can work directly on a renderer filter which is a solution I haven’t tried yet.

The main advantage of this approach is that you can use almost any video source (compressed or not) because filters for decompression already exist (ffdshow for instance).
However, working with DirectShow can be sometimes painful. But after your filter is created you can built any kind of application easily.

Hope that helps.