yes, in short, each thread counts number of zeroes in one sub-array, then algorithm counts initial output position for each sub-array, and final pass copies the data
if you need ready-to-use function, Thrust can do it. It’s a part of CUDA closely mimicking STL features
I tried the thrust library, and indeed it is what i was looking for,but it is too slow.
I used the:
thrust::copy_if(thrust::device, in, in + width*height, out, is_even());
and for size = 512x384 the time is 3.3ms which is huge for the application i want it for.
And for size = 1024x1024 the time is 5.4ms…
Is there simething faster or any algorithm (no functions from libraries, just CUDA C) that do the same thing but more quickly? A method based on reduction or something else? I cannot think something good!
BulatZiganshin i cannot understand your idea with the threads counting the zeros. Could you please explain me a little bit more, or show me a piece of code or pseydocode to understand it?
[1] What is the desired or required execution time for your application or use case?
[2] What hardware are you using? Discussions of software performance that do not include the specification of the hardware used are meaningless. In the case of GPUs, the performance of an identical piece of software can easily span a decimal order of magnitude between the slowest and the fastest GPUs in common use at any given time.
txbob already gave you the name of the general algorithm you are looking for: stream compaction. This is a form of reduction by its very nature.
BulatZiganshin sketched an outline of how you could implement this yourself. With the help of Google Scholar you should be able to find much relevant literature, and you can probably find worked examples on Github and similar code repositories.