How to parallel a seirial code

josephjd1996 · March 15, 2018, 11:25am

Hello to all!
I am looking for a parallel cuda code doing this:

Given the array a[1,0,0,0,0,3,0,0,5,2,2,3,1,0,0,0,0,0,2]
the output will be the array
a[1,3,5,2,2,3,1,2] (it is the same array without the zeros)

I think that this algorithm is seirial, but i need a parallel algorithm to do this. Is there any ready cuda code for something like this?

Thaink you very much in advance!

BulatZiganshin · March 15, 2018, 12:59pm

yes, in short, each thread counts number of zeroes in one sub-array, then algorithm counts initial output position for each sub-array, and final pass copies the data

if you need ready-to-use function, Thrust can do it. It’s a part of CUDA closely mimicking STL features

Robert_Crovella · March 15, 2018, 1:12pm

google thrust stream compaction

if you’re not familiar with thrust, google and read the thrust quick start guide first

josephjd1996 · March 16, 2018, 3:03pm

Hello!
Thaink you for your answers.

I tried the thrust library, and indeed it is what i was looking for,but it is too slow.
I used the:

thrust::copy_if(thrust::device, in, in + width*height, out, is_even());

and for size = 512x384 the time is 3.3ms which is huge for the application i want it for.
And for size = 1024x1024 the time is 5.4ms…

Is there simething faster or any algorithm (no functions from libraries, just CUDA C) that do the same thing but more quickly? A method based on reduction or something else? I cannot think something good!

BulatZiganshin i cannot understand your idea with the threads counting the zeros. Could you please explain me a little bit more, or show me a piece of code or pseydocode to understand it?

Thaink you!!

njuffa · March 16, 2018, 4:16pm

[1] What is the desired or required execution time for your application or use case?

[2] What hardware are you using? Discussions of software performance that do not include the specification of the hardware used are meaningless. In the case of GPUs, the performance of an identical piece of software can easily span a decimal order of magnitude between the slowest and the fastest GPUs in common use at any given time.

txbob already gave you the name of the general algorithm you are looking for: stream compaction. This is a form of reduction by its very nature.

BulatZiganshin sketched an outline of how you could implement this yourself. With the help of Google Scholar you should be able to find much relevant literature, and you can probably find worked examples on Github and similar code repositories.

Topic		Replies	Views
How to count a sequence of 1's CUDA Programming and Performance	3	1679	October 5, 2009
Algorithm query... CUDA Programming and Performance	3	452	March 17, 2011
Parallel Search Algorithms Parallel search algorithm for determining all non-zero array elements CUDA Programming and Performance	2	3505	November 14, 2011
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1440	October 30, 2022
My thrust code is 10 times slower than CPU, what did I do wrong GPU-Accelerated Libraries cuda , thrust	8	1457	November 15, 2022
'coompacting' result array CUDA Programming and Performance	4	4384	May 13, 2010
Parallel reduction problem CUDA Programming and Performance	1	5082	November 29, 2010
Thrust reduction question CUDA Programming and Performance	2	1119	February 27, 2014
How to push back thread index which pass a condition in cuda kernel like numpy's “Where” op? CUDA Programming and Performance	2	421	April 19, 2021
Thinking parallel CUDA Programming and Performance	5	1953	January 17, 2010

How to parallel a seirial code

Related topics