Any example on real time video processing

pidanchen · August 24, 2011, 10:35pm

I am looking for an introductory example on how to use CUDA+OPENGL to do real-time video processing. Seems that CUDA SDK only contains a couple of image processing example. Any input is appreciated.

Thanks

pidanchen · August 24, 2011, 11:26pm

I am trying to read AVI files. If there is any AVI decoder example, that should also be great.

alrikai · August 25, 2011, 2:48am

Look at the “CUDA Video Decoder GL API” sample in the SDK. It decodes the video frames with the cuvid API decoder, after which you can map the decoded frame to device memory or host memory and perform some post processing on it. With that said, if you absolutely need to do AVI files, you’re out of luck as far as the cuvid decoder goes; at the moment, I believe it only does MPEG-1, 2 (and maybe h.264).

pidanchen · August 27, 2011, 12:23am

Thank you for your reply.

I looked at that example. Seems that it is based on driver API instead of runtime API (I guess maybe due to the faster performance if decoding is done entirely on the hardware itself?) Anyway, I am a complete newbie and I only know about runtime API.

I double checked my AVI video, seems that it is uncompressed~!!~ YAH ~!! This is what I did:

I read the AVI with VFW lib (frame by frame) under Windows. I display/process the currentframe with CUDA(texture Memory)+OpenGL and then render the frame in OpenGL. The code works, however, I really wanna optimize its performance since I will be working on super high resolution videos (1920*1280). Here is part of my display callback code under glutMainLoop() and I have perfomance questions.

// I have a simple class called AVIFile which can read the AVIFile and output the currentframe with the memberfunction ReturnCurrentFrame()

class AVIFile

{

uchar4 * ReturnCurrentFrame();

}

// In main(), I create a AVIFile pointer and I declare the device memory frameDev in the main()

AVIFile *videoAVI = new ....

CUDAMallocArray(&frameDev, &uchar4Desc, width, height);

texture<uchar4,2> texFrame; // texture memory

/////////////////////////////////////

// display callback

/////////////////////////////////////

void displayFunc(void){

// output frame

unsigned int *outputFrame;

// copy new frame to device memory

cudaMemcpyToArray(frameDev,0,0,videoAVI->ReturnCurrentFrame(),size,cudaMemcpyHostToDevice);

// Map resources

cudaGraphicsMapResources(1, &cuda_pbo_resource, 0)

cudaGraphicsResourceGetMappedPointer((void**)&outputFrame, &num_bytes, cuda_pbo_resource));

// bind device memory to texture memory

cudaBindTextureToArray(texFrame,frameDev);

// CUDA Filtering

Filter(outputFrame);

// Unbind texture

cudaUnbindTexture(texImage);

// Unmap resources

cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0)

// OPENGL rendering

    glClear(GL_COLOR_BUFFER_BIT);

	glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, videoAVI->GetWidth(), videoAVI->GetHeight(), GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_DATA(0) );

    glBegin(GL_TRIANGLES);

        glTexCoord2f(0, 0); glVertex2f(-1, -1);

        glTexCoord2f(2, 0); glVertex2f(+3, -1);

        glTexCoord2f(0, 2); glVertex2f(-1, +3);

    glEnd();

    glFinish();

    glutSwapBuffers();

//next frame

videoAVI->Go2NextFrame();

}

How do I further optimize the performance? Since I haven’t seen any video processing example with runtime API. I am not sure my code flow is the correct one. Like is it necessary to do the following in each display callback:

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Is this the correct order? Is there any steps that I can bypass or simplify?

Regarding each component, I have 2 more questions:

In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.
I have been using the above OPENGL Rendering part for a while, I am not an OpenGL expert either. Can someone tell me if what I was doing above (opengl rendering part) is correct?

Thank you so much for reading

alrikai · August 29, 2011, 5:08pm

I’s not sure if there are any examples of doing video processing with the runtime api. This is a consequence of the cuvid library using the driver api; ordinarily you’d want to use cuvid if you were decoding frames, but since yours are already decompressed, I guess it’s not necessary. I know very little about OpenGL (trying to learn at the moment), so I’m not sure I’d be of much help. Of what I see of your code, it seems to be fine. What kind of FPS are you getting?

pidanchen · August 29, 2011, 9:34pm

Thanks. The FPS is not bad. When playing 1920p video, I noticed that the frame freezes after a while and then jump right back to the beginning. Say the video has 100 frames, after playing 80 of them, the screen freezes at frame 80 and then jump back to frame 1 after a little while. What might be the reason for this behavior? The video I am playing is over 2G. Is it possible that the memory is running out?

alrikai · August 30, 2011, 6:42pm

Hard to say offhand why that’s happening; are you buffering the frames? And if so, is it a bounded buffer? When you play the video, do you loop it or anything, or do you only play it once?

cbuchner1 · August 31, 2011, 9:19am

I once posted a sample where I combined the BoxFilter SDK sample with a realtime video grabber library. I think the .zip file download is now broken, but I could dig it out again if needed.

While blurring a web cam image may be of limited use, it is one example of realtime video processing.

pidanchen · September 1, 2011, 6:31pm

I don’t know too much about openGL. What does “buffering the frames” mean?

I think my main concern is whether my work flow in each display callback:

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Can I further optimize this work flow? In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

I found two related links dicussing similar perfomance issues.

pidanchen · September 1, 2011, 6:32pm

Hello, thank you for your reply. I am very interested in your code, especially your work flow.

I think my main concern is whether my work flow in each display callback is the best one.

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Can I further optimize this work flow? In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

I found two related links dicussing similar perfomance issues.

pidanchen · September 14, 2011, 5:34pm

I found something wierd. I understand that the first run of the CUDA code is slow, on a full 1280HD (1920x1280), in the first run it only achieves 20 FPS. It’s fine. In the second run, it can achieve 60 FPS, but after a little while, it drops back to 20 FPS and never comes back. I think it might due to something else, maybe the memory is running out?

Thank you for your reply.

I looked at that example. Seems that it is based on driver API instead of runtime API (I guess maybe due to the faster performance if decoding is done entirely on the hardware itself?) Anyway, I am a complete newbie and I only know about runtime API.

I double checked my AVI video, seems that it is uncompressed~!!~ YAH ~!! This is what I did:

I read the AVI with VFW lib (frame by frame) under Windows. I display/process the currentframe with CUDA(texture Memory)+OpenGL and then render the frame in OpenGL. The code works, however, I really wanna optimize its performance since I will be working on super high resolution videos (1920*1280). Here is part of my display callback code under glutMainLoop() and I have perfomance questions.
// I have a simple class called AVIFile which can read the AVIFile and output the currentframe with the memberfunction ReturnCurrentFrame()

class AVIFile

{

uchar4 * ReturnCurrentFrame();

}

// In main(), I create a AVIFile pointer and I declare the device memory frameDev in the main()

AVIFile *videoAVI = new ....

CUDAMallocArray(&frameDev, &uchar4Desc, width, height);

texture<uchar4,2> texFrame; // texture memory

/////////////////////////////////////

// display callback

/////////////////////////////////////

void displayFunc(void){

// output frame

unsigned int *outputFrame;

// copy new frame to device memory

cudaMemcpyToArray(frameDev,0,0,videoAVI->ReturnCurrentFrame(),size,cudaMemcpyHostToDevice);

// Map resources

cudaGraphicsMapResources(1, &cuda_pbo_resource, 0)

cudaGraphicsResourceGetMappedPointer((void**)&outputFrame, &num_bytes, cuda_pbo_resource));

// bind device memory to texture memory

cudaBindTextureToArray(texFrame,frameDev);

// CUDA Filtering

Filter(outputFrame);

// Unbind texture

cudaUnbindTexture(texImage);

// Unmap resources

cudaGraphicsUnmapResources(1, &cuda_pbo_resource, 0)

// OPENGL rendering

    glClear(GL_COLOR_BUFFER_BIT);

	glTexSubImage2D( GL_TEXTURE_2D, 0, 0, 0, videoAVI->GetWidth(), videoAVI->GetHeight(), GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_DATA(0) );

    glBegin(GL_TRIANGLES);

        glTexCoord2f(0, 0); glVertex2f(-1, -1);

        glTexCoord2f(2, 0); glVertex2f(+3, -1);

        glTexCoord2f(0, 2); glVertex2f(-1, +3);

    glEnd();

    glFinish();

    glutSwapBuffers();

//next frame

videoAVI->Go2NextFrame();

}
How do I further optimize the performance? Since I haven’t seen any video processing example with runtime API. I am not sure my code flow is the correct one. Like is it necessary to do the following in each display callback:

GetCurrentFrame → Copy to CUDA local memory → Map resources → Bind texture → CUDA Processing → Unbind texture → Unmap resources → OpenGL Rendering → …

Is this the correct order? Is there any steps that I can bypass or simplify?

Regarding each component, I have 2 more questions:

In each callback, I need to read the current frame in CPU with vfw library and then use memcpy to copy it to local device memory. This might be time consuming given 1920*1280 frame size.

I have been using the above OPENGL Rendering part for a while, I am not an OpenGL expert either. Can someone tell me if what I was doing above (opengl rendering part) is correct?

Thank you so much for reading

pidanchen · September 14, 2011, 6:36pm

I think I found the problem (not solution yet). I am reading AVI file with windows function AVIStreamGetFrame(). I think this function has memory issues, it becomes incredibly slow after say reading 20 frames from it.

Twz · January 6, 2012, 11:12pm

Hi,

A good solution that worked for me was to use Microsoft DirectShow filters. You can start with a transform filter where you can incorporate your CUDA functions. And instead of transferring back each frame to the CPU you can use the Direct3D or OpenGL interoperability. The output of the transform filter just has to be connected to a NULL renderer in that case. Or you can work directly on a renderer filter which is a solution I haven’t tried yet.

The main advantage of this approach is that you can use almost any video source (compressed or not) because filters for decompression already exist (ffdshow for instance).
However, working with DirectShow can be sometimes painful. But after your filter is created you can built any kind of application easily.

Hope that helps.

Topic		Replies	Views
CUDA very slow performance CUDA Programming and Performance	21	16734	March 6, 2020
is it possible to do real-time processing of what is being sent to my monitor? CUDA Programming and Performance	10	6971	January 12, 2012
[Linux] NVCuvid - Performarce CUDA Programming and Performance	13	4041	March 9, 2016
building a cucv? computer vision CUDA Programming and Performance	27	20479	September 14, 2009
cudaGLMapBufferObject (and unmap) performance These calls take way too long CUDA Programming and Performance	47	76293	February 14, 2010
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134575	May 26, 2010
How to get performance of Video processing application Jetson TK1	10	1815	July 19, 2018
Cuda encoder CUDA Programming and Performance	26	23366	November 6, 2013
Zero Copy Memory vs Unified memory CUDA processing Jetson TX1	28	20362	October 18, 2021
GPU Context Switching Issue in DeepStream with CuPy Post-Processing DeepStream SDK cuda , cupy , jetson , deepstream	21	115	March 10, 2025

Any example on real time video processing

Related topics