Find maximum values for each column of 2D array

gongjing · May 5, 2022, 10:54am

Hi,

We have a code to find maximum values for each column of 2D array and then divided by the maximum values likes

__global__ update_A_kernel(float* A, int NROW, int NCOL) 
{
...
   int i =  threadIdx.x + blockDim.x * blockIdx.x;
   if (i < NROW) {
      float rmax = A[i*NCOL];
      for (int j = 0; j < NCOL; j++)  {
         rmax = fmaxf(rmax, A[i*NCOL+j]);
     }
     __syncthreads();
    for (int j = 0; j < NCOL; j++)  {
        A[i*NCOL+j]) /= rmax;
   }
}
...

update_A_kernel<<<NROW, 1>>>(...);

Right now we launch the cuda kernel with NROW thread block and 1 thread, which is not efficient. We try to launch the kernel with 2D thread block e.g. dim3(NROW, NCOL) to remove the second for-loop but not idea how to deal with the first loop with atomicMax.

Typical sizes of NROW and NCOl is O(100-2000). can the performance be improved using a 2D thread block?

Thanks. /Jing

striker159 · May 5, 2022, 1:57pm

For the first loop, you are looking for an algorithm called parallel reduction. Use for example all 128 threads per block to perform a parallel reduction to find the maximum value. Then use all 128 threads to update A.

Robert_Crovella · May 5, 2022, 3:23pm

Your posted code is finding the maximum value of each ROW, not each column. Yes, we could possibly have a difference of terminology, but the code itself seems to have a sense of row and column, and the thing that your first loop is iterating over to find the maximum is the columns, therefore I claim that the code itself has a sense of finding the maximum of each row:

  for (int j = 0; j < NCOL; j++)  {  // this is iterating over the columns, along a single row

If you can transpose your data, you can use a method like the one you have shown, with a few minor changes, and get good efficiency.

Otherwise you will need to learn about parallel reduction techniques, as already mentioned, to get good memory efficiency. There are many many many writeups on various web forums about parallel reduction methods.

Because you are doing one reduction on each row, you will want to use a segmented parallel reduction. Here is a recent thread discussing a very simple example of a segmented parallel reduction.

gongjing · May 5, 2022, 3:41pm

Hi Robert,

Your posted code is finding the maximum value of each ROW, not each column.

Sorry, it was typo. It should find the maximum value for each row.

Because you are doing one reduction on each row, you will want to use a segmented parallel reduction. Here is a recent thread discussing a very simple example of a segmented parallel reduction.

I will look at it. Thanks for the information. /Jing

Topic		Replies	Views
[SOLVED] Finding the maximum values with CUDA CUDA Programming and Performance	4	9256	October 13, 2017
Find the Maximum value among 16 threads CUDA Programming and Performance	2	3723	June 12, 2008
Multiple Reduction in a 2D array Using the easiest reduction example of the SDK CUDA Programming and Performance	6	1901	November 18, 2009
about finding a max number from a big array CUDA Programming and Performance	10	4836	January 7, 2020
Matrix Reduction CUDA Programming and Performance	7	8483	November 18, 2009
Min Max problem in parallel CUDA Programming and Performance	2	1650	September 25, 2008
Find maximum value from threads CUDA Programming and Performance	6	584	December 16, 2023
Cuda : Reduce (max/min) function on matrix implementation CUDA Programming and Performance	1	1769	August 22, 2019
parallel maximum detection bad performance CUDA Programming and Performance	11	3719	June 5, 2008
Where is the bottleneck in my parallel max reduction code? CUDA Programming and Performance	2	6759	September 5, 2011

Find maximum values for each column of 2D array

Related topics