How to parallel a seirial code

Hello to all!
I am looking for a parallel cuda code doing this:

Given the array a[1,0,0,0,0,3,0,0,5,2,2,3,1,0,0,0,0,0,2]
the output will be the array
a[1,3,5,2,2,3,1,2] (it is the same array without the zeros)

I think that this algorithm is seirial, but i need a parallel algorithm to do this. Is there any ready cuda code for something like this?

Thaink you very much in advance!

yes, in short, each thread counts number of zeroes in one sub-array, then algorithm counts initial output position for each sub-array, and final pass copies the data

if you need ready-to-use function, Thrust can do it. It’s a part of CUDA closely mimicking STL features

google thrust stream compaction

if you’re not familiar with thrust, google and read the thrust quick start guide first

Hello!
Thaink you for your answers.

I tried the thrust library, and indeed it is what i was looking for,but it is too slow.
I used the:

thrust::copy_if(thrust::device, in, in + width*height, out, is_even());

and for size = 512x384 the time is 3.3ms which is huge for the application i want it for.
And for size = 1024x1024 the time is 5.4ms…

Is there simething faster or any algorithm (no functions from libraries, just CUDA C) that do the same thing but more quickly? A method based on reduction or something else? I cannot think something good!

BulatZiganshin i cannot understand your idea with the threads counting the zeros. Could you please explain me a little bit more, or show me a piece of code or pseydocode to understand it?

Thaink you!!

[1] What is the desired or required execution time for your application or use case?

[2] What hardware are you using? Discussions of software performance that do not include the specification of the hardware used are meaningless. In the case of GPUs, the performance of an identical piece of software can easily span a decimal order of magnitude between the slowest and the fastest GPUs in common use at any given time.

txbob already gave you the name of the general algorithm you are looking for: stream compaction. This is a form of reduction by its very nature.

BulatZiganshin sketched an outline of how you could implement this yourself. With the help of Google Scholar you should be able to find much relevant literature, and you can probably find worked examples on Github and similar code repositories.