Can cuda process variable length arrays to process variable image sizes?


I have written an application in c++ which takes the input of hundreds of images from a fixed position camera. My software processes the data and removes moving objects such as cars or people to produce a single image clean plate. This works by counting frequencies of pixels across multiple images.

Here is a video of what my software does.

Over the past several months I have rewritten it several times and it is now probably as fast as I can get it with serial code. Usually it only takes a few minutes to process the images however it can rise a lot when using hundreds.

The nature of the solution to this process is perfect for parallel processing since I could do many pixels at the same time and it only really requires a very simple use of cuda to do so.

It also occurred to me that I could use it to make a lossless video codec using the same code.

The question I would like to ask though is to do with cuda and variable length arrays. My software allows variable image sizes and I process the image data inside for loops with std::vectors. The size of these vectors changes at runtime depending on if the user inputs 1280 by 720 or 1920 by 1080 or 4k images as an image sequence. Any size really as long as they are all the same size in any particular sequence.

If I were to use cuda to do this processing do I have to set strictly defined image size dimensions at compile time? As in the user must use 1920 by 1080? Or write code to re-scale the input images to suit a cuda compile time predetermined image size? I could do that if it is necessary and re-scale the output image back to the original size when finished? It would still work if I have to do it that way.

I hope my question is clear enough that someone with cuda experience can answer me so I know what are the limits so I can begin in the right direction to write a bit of cuda.

None of that should be necessary. CUDA C++ is a programming model that is quite a bit like C++. (And there are other language bindings also, like fortran, python, java, C#, etc.) It is flexible. You can write a CUDA kernel to do image processing that can flexibly and efficiently handle varying image sizes.

I should also mention that there are various libraries built on top of CUDA (so-called CUDA-X libraries). One of these is NPP which is largely similar to intel’s IPP image processing primitives. You might find that a library like that may be useful. And of course those libraries can flexibly handle varying image sizes, without any compile-time constraints.

Ah, that’s good to know.

Thanks for that Robert. I’ll check out those libraries mate.

I am just wondering about the mathematical approach you’re taking. Do you apply a median filter to each pixel, taking the RGB values of N consecutive video frames as an input?

If you’re doing a median filtering, then you could apply a sort algorithm such as a bitonic sort per pixel. There should be plenty of prior art with regard to CUDA implementations (both in terms of existing libraries, as well as example CUDA kernel source code)

I have a feeling that CUDA should be powerful enough to apply this filtering in real time to a continuous video sequence (with a limited history of video frames).

Can your software also deal with slowly panned video? (i.e. is there any motion compensation done). The pinnacle of achievement would be to use video taken from a moving perspective (e.g. handheld camera) as input and to get a clean plate video out. Imagine making a video of your home town with not a single person, moving animal or vehicle in it - taken at rush our! This is likely where one would have to venture into the domain of Machine learning to get best results with limited computational effort.


Hi Christian,

Originally I did think about median filtering and then I thought to just do the exact most occurring frequency count.

I’ll try to explain in plain English rather than code or mathematics.

The camera has to be on a locked tripod and the motion must be across the camera. It doesn’t work if objects are travelling towards the camera.

Consider say 100 images and only looking at the rgb value of the first pixel in each of those images and totally ignore every other pixels in all the images. Literally only consider the very first pixel of each image.

Imagine you check the rgb color of those 100 pixels and 95 of them are orange in color and then 2 are black, 2 are red and 1 is green.

So then I say, ok these pixels are usually orange so I think that is the real rgb color when no object is passing over it. I keep that orange value and assign it to be the color of the first pixel of the new clean image and discard the other color info.

This works because the color of a pixel really only changes if a car or person happens to be passing by at the moment the picture is taken.

Then I simply repeat the same process for the second pixel of every image and the third and … Once they are all done I create the new image and the moving objects are gone.

It does take a few minutes though to do because the code is literally doing a frequency count of hundreds of millions of bytes. At first I ran into a problem known as the “Count Distinct Problem” so I had to figure away round that so it was a little bit tricky to code.

I think with cuda it would be really good though because I could process lots of the pixels in parallel.

I like your idea about the video of a town with no one in it. Would be a good effect for a movie or a hyperlapse with no people.

even though it’s not the method you’re currently using, I found a 2014 thread that deals with optimizing a temporal median filter (intended for noise removal in video sequences)

here’s a 2016 research paper that claims realtime performance in temporal median filtering (milliseconds per frame)

Thanks for posting that. It’s very interesting because I can sort of see what they are doing even though I don’t understand cuda yet. Some of the stuff is very similar about striding through the images. I am not sure when they say “stack 21 images” Does it mean they are using 21 different vectors to hold the image data?

What I do in my code is I load all the data from every image into a massively long 1 dimensional std::vector of chars so the vector might have image data from hundreds of images combined into one. I read 4 bytes for the first pixel, first image and then do a huge hop to get the next 4 bytes to the first pixel, second image and so on. It seems to access the data very quickly doing that. Originally I wrote it so all the data from all the images was in a big 3 dimensional vector. While it worked, it was slower and I kept running out of memory. 100 or so images used about 32Gb of ram. The new way hardly uses any ram and is much faster.

I need to think how to get the data into a good cuda structure but at the moment I am learning the basics of cuda.

What type of things do you do with cuda?

what they mean by stacking is probably just the concept of layering 21 pictures on top of each other in order to eliminate noise. Think images taken under extreme low light conditions.

professionally I do radio channel modeling and simulation in CUDA, in private I do everything else ;) big number arithmetics, crypto currency mining, fractals, etc…

Ah, nice. So very practical purposes. Once I get to grips with it I am going to try and model some stuff too.