Parallel Patterns Request for participation!


“Parallel Patterns” is all about the patterns involved in parallel decomposition of a problem.

An UIUC page:

This topic is an effort to collect such patterns across different verticals (right from HOOMD, finance, oil n gas, etc…).
It is impossible for one person to know all verticals.
So, a forum like this would be an ideal place to collect and document such patterns.

To get to the same page before documenting, It would be good if all of us can go through the PPT @ UIUC website:

Then, we can start identifying the algorithms that each of us work and :
Associate then with an existing pattern
Create a new Pattern and document it.

At the end of it (or somewhere in mid), we can create a PPT out of it for every1’s benefit.

One has to note that parallelization happens in 2 stages:

  1. For multi-GPU (same as how one would parallelize for a distributed memory system) - Mostly a Divide and Conquer Pattern
  2. Inside a GPU (Same as how one would parallelize for a shared memory system)

Also, we need to look for any specific CUDA pattern that people use (I can understand that people may not be willing to discuss particular techniques they use to boost performance. But atleast if we can get some @ a broad level, it should be ok)

I will document what I know of, in a subsequent reply to this topic.

Thanks for all your time!

Best Regards,


Thanks for the link to this excellent resource! I agree that a patterned approach in learning to partition parallel apps is needed.



Thanks for your kind words.

As the topic starter, I have a responsbility of adding something from my side before I can even ask others. So, Here we go.

Here is a description of the binomial-tree algorithm that is used to price stock options in finance. I wont get into details of finance but I am going to give you the details of the computation…


Consider this n-step re-combining binomial tree:

The nodes at "n"th time step are nothing but POSSIBLE option prices at "N"th time-step. “N” is a configurable number. The greater the value of “N” - the more accurate is your option price (BIG N means -> More sample space of probable option prices)

The algorithm begins by finding the nodes in the "n"th time step and then back calculating the node values at previous time steps so on until node ‘0’ – which is the option price today.

Generation of "n"th node data is an embarassingly-parallel situation. All n+1 nodes can be calculated independently.

However, the nodes at step “n-1” are dependent on nodes at step “n”. Thus, there is a sequential dependency. i.e You cannot generate nodes of more than 1 time-step at any given point of time. You have to come down step by step.

More specifically,

The "i"th node at time step “n-1” depends on "i"th node of step “n” and "i+1"th node of step “n”.


I am not going to give the most optimal one on CUDA because thats going to be proprietary to our company. I will talk about how NVIDIA chose to parallelize this problem (Its there in the sample)

  1. One Block Per Option

  2. 256 threads per block (not sure if 256 exactly… Just that more number of threads).

  3. Application prices 1000s of options bunched together.

Step 0:

The block starts with "N"th node and first generates all nodes (embarassingly parallel)

Step 1:

Iteration 1

The block then “selects” a chunk of “K” nodes out of the “n+1” nodes and reduces it until it reaches timestep “J”.

The resulting figure looks like a trapezium. See the highlighted portion below.

0 0 0 0 0 0 0 0 0

0 0 0 0

0 0 0

In the figure above “K = 5”, “n = 8”, “J = 6”.

Iteration 2

The block then starts with time step “n” again and selects the next “K” nodes. However this set of “K” nodes starts from index “3”. (because we have already generated 3 nodes at time step J). The processing done in this iteration is shown in bold in the figure below:

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0

Now we have generated 6 nodes for time step J (which is 6). One more node is remaining.

Another iteration will be needed to generate nodes of time step “J” completely.

As you can see there is redundancy of computation and memory operations. So, it is not an optimal solution.

PARALLEL SOLUTION For Distributed Memory Systems

Check this URL out:


More than the design pattern, I think we should worry about the problem pattern.

This problem is like a “Descending Staircase” OR more generally “Step-wise Computation”.

Step wise computation pattern

  1. First, apply a set of operations on a data-set to generate another set of data

  2. Then, use that data to generate another set and so on until completion.

The wavefront model of computation in molecular dynamics (vague memory… correct me if I am wrong) also involves such a calculation.

Design pattern

I think the problem pattern is good enough. We need to collect inputs from the community to analyze the solution for patterns.

Request to readers

Have you encountered a similar problem?? If so, kindly spare some time and document both the problem and the solution.

Appreciate your time! It will help us all learn! THANKS!

Best Regards,