This topic is an effort to collect such patterns across different verticals (right from HOOMD, finance, oil n gas, etc…).
It is impossible for one person to know all verticals.
So, a forum like this would be an ideal place to collect and document such patterns.

Then, we can start identifying the algorithms that each of us work and :
Associate then with an existing pattern
OR
Create a new Pattern and document it.

At the end of it (or somewhere in mid), we can create a PPT out of it for every1’s benefit.

One has to note that parallelization happens in 2 stages:

For multi-GPU (same as how one would parallelize for a distributed memory system) - Mostly a Divide and Conquer Pattern

Inside a GPU (Same as how one would parallelize for a shared memory system)

Also, we need to look for any specific CUDA pattern that people use (I can understand that people may not be willing to discuss particular techniques they use to boost performance. But atleast if we can get some @ a broad level, it should be ok)

I will document what I know of, in a subsequent reply to this topic.

As the topic starter, I have a responsbility of adding something from my side before I can even ask others. So, Here we go.

Here is a description of the binomial-tree algorithm that is used to price stock options in finance. I wont get into details of finance but I am going to give you the details of the computation…

PROBLEM STATEMENT:

Consider this n-step re-combining binomial tree:

The nodes at "n"th time step are nothing but POSSIBLE option prices at "N"th time-step. “N” is a configurable number. The greater the value of “N” - the more accurate is your option price (BIG N means -> More sample space of probable option prices)

The algorithm begins by finding the nodes in the "n"th time step and then back calculating the node values at previous time steps so on until node ‘0’ – which is the option price today.

Generation of "n"th node data is an embarassingly-parallel situation. All n+1 nodes can be calculated independently.

However, the nodes at step “n-1” are dependent on nodes at step “n”. Thus, there is a sequential dependency. i.e You cannot generate nodes of more than 1 time-step at any given point of time. You have to come down step by step.

More specifically,

The "i"th node at time step “n-1” depends on "i"th node of step “n” and "i+1"th node of step “n”.

PARALLEL SOLUTION for CUDA

I am not going to give the most optimal one on CUDA because thats going to be proprietary to our company. I will talk about how NVIDIA chose to parallelize this problem (Its there in the sample)

One Block Per Option

256 threads per block (not sure if 256 exactly… Just that more number of threads).

Application prices 1000s of options bunched together.

Step 0:

The block starts with "N"th node and first generates all nodes (embarassingly parallel)

Step 1:

Iteration 1

The block then “selects” a chunk of “K” nodes out of the “n+1” nodes and reduces it until it reaches timestep “J”.

The resulting figure looks like a trapezium. See the highlighted portion below.

0 0 0 0 0 0 0 0 0

0 0 0 0

0 0 0

In the figure above “K = 5”, “n = 8”, “J = 6”.

Iteration 2

The block then starts with time step “n” again and selects the next “K” nodes. However this set of “K” nodes starts from index “3”. (because we have already generated 3 nodes at time step J). The processing done in this iteration is shown in bold in the figure below:

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0

0 0 0 0 0 0

Now we have generated 6 nodes for time step J (which is 6). One more node is remaining.

Another iteration will be needed to generate nodes of time step “J” completely.

As you can see there is redundancy of computation and memory operations. So, it is not an optimal solution.