I am thinking of two ways to parallelize LInf Norm of a matrix:

Loop over every column. Each column compute the vector sum using parallel reduction.

Assign each thread to a column summation. The summation for one column is performed in serial.
Which of the above way will give better performance?
After the sum of each column is computed, they are stored in a 1D array. I am thinking of modifying the parallel reduction example to find the maximum element in the array, but would that lead to divergence? (Since each threads would have an if condition)