Real Time image Processing CUDA

Hi all

I really need some help and advice as I’m new with CUDA coding and image processing.

I am trying to implement an algorithm for a system which the camera get 1000fps, and I need to get the value of each pixel in all images and do the different calculation on the evolution of pixel[i][j] in N number of images, for all the pixels in the images. I have the (unsigned char *ptr) I want to transfer them to the GPU and start implementing the algorithm.but I am not sure what would be the best option for realtime processing.
my system :
CPU Intel Xeon x5660 2.8Ghz(2 processors)
GPU NVIDIA Quadro 5000

can you please give me some idea about the following questions:

  1. I do I need to add any Image Processing library addition to CUDA ??? if yes what do you suggest?

  2. as I am new to CUDA programming, can I create a matrix for pixel[i,j] containing values for images [1:n] for each pixel in the image size? for example for 1000 images with 200x200 size I will end up with 40000 matrix each
    containing 1000 values for one pixel? Does CUDA gives me some options like OpenCV to have a Matrices ? or Vector ?

please if you have any idea or recommendation, let me know.
I really need some expert advice.
Thank you

ArrayFire is a CUDA library that has both image processing library functions as well as easy matrix manipulation and subscripting and sounds like a good fit. Links for your questions are below:

Image Processing:

Manipulating matrices and subscripting:

Although, your system is among the best, for real time implementation of your algorithm, you have to consider the limitation of memory transformation bandwidth from host to device and the other way around.

To reach the maximum memory blocks transformation, you’d better use Textures and Buffer Objects in OpenGL to “copy next frames”, “process previous frames” and “depict the results”, simultaneously (of course, it’s not the only way). Afterwards, based on your algorithm (do not forget to design parallel version of your algorithm in advance!!!), you can utilize faster memory structures like shared memory as a draft to compute whatever you want.

If you need more details let me know (sa d~o~t dehghani a~t gmail d~o~t com).

You can try putting the image into a 2D texture (to be more exact a cudaArray that is bound to a texture). That gives you cached read access, and when required also 2D bilinear interpolation that allows for supersampling calculations).

I seriously doubt that the data rate resulting from a 1000 FPS image capture can be transferred to the GPU in real time. PCI-Express bandwidth limitations. Can you give us the expected data rate that you get from the cam?

There are options for DMA to the card, which might be supported by a Quadro. Like direct transfer from a video capture card to a GPU. But that is one of the enterprise solutions that I do not have too much information about.


Quadro cards with “SDI I/O” option support SD/HD broadcast standards (25, 30, 50 and 60 fps not something like 1000 fps).

The best way to use DMA is based on “Buffer Objects” !!!.



hi all,

the link to the Image processing library was quite helpful & informative, thanks for that,
by the way, how d you approach a video file, say AVI ? I had a look into the forum,
some suggest using MsVideoForWindowsLibrary, some suggest using OpenCV, I feel comfortable
leaving all the dirty job to OpenCV but is there any throughput trade-offs ?



For the decoding of video, you can use the decoder API (which is part of CUDA, as opposed to having to use any external libraries). You also have the added advantage of having the decoded frames already in GPU memory, whereas if you use a CPU-based method for decoding you’ll have to transfer the video frames to the GPU after decoding them to perform GPU-based image processing.