As I mentioned before, each parallel thread will perform a partial reduction on the loop iterations that it executes. Then after the loop, the compiler inserts another parallel region to perform the final reduction of each thread’s partial reduction. Reductions will work on operations that are associative since the order in which the operations occur does not matter.

For example, let’s use your example 4 iterations of the loop using two parallel threads. Thread 1 will compute a partial reduction for iterations 1 and 2, Thread 2 will compute iterations 3 and 4.

Thread 1:

I1: 1 * 1 => 1

I2: 1 * 2 => 2

Thread 2:

I3: 1 * 3 => 3

I4: 3 * 4 => 12

T1’s partial reduction = 2

T2’s partial reduction = 12

Next the final reduction will be: 2 * 12 => 24

Let reorder this so Thread 1 now computes iterations 1 and 3, and Thread 2 computes 2 and 4

Thread 1:

I1: 1 * 1 => 1

I2: 1 * 3 => 3

Thread 2:

I3: 1 * 2 => 2

I4: 2 * 4 => 8

T1’s partial reduction = 3

T2’s partial reduction = 8

Next the final reduction will be: 3 * 8 => 24

Does this help you understand the basics of how parallel reductions work?

I did find this Wikipedia page which gives more of the theoretical background if you want more details: Reduction Operator - Wikipedia