L Infinity Norm

I am thinking of two ways to parallelize L-Inf Norm of a matrix:

  1. Loop over every column. Each column compute the vector sum using parallel reduction.

  2. Assign each thread to a column summation. The summation for one column is performed in serial.

Which of the above way will give better performance?

After the sum of each column is computed, they are stored in a 1-D array. I am thinking of modifying the parallel reduction example to find the maximum element in the array, but would that lead to divergence? (Since each threads would have an if condition)

2), if I understand the two options correctly. Just make sure that your memory access gets coalesced.

max() directly maps to a single machine instruction, so no divergence results. Even without the max instruction, it can be implemented very efficiently using predicated instructions (and the compiler does this, bar automatic translation into the max instruction), so divergence is not an issue here.