Hi,

We have a code to find maximum values for each column of 2D array and then divided by the maximum values likes

```
__global__ update_A_kernel(float* A, int NROW, int NCOL)
{
...
int i = threadIdx.x + blockDim.x * blockIdx.x;
if (i < NROW) {
float rmax = A[i*NCOL];
for (int j = 0; j < NCOL; j++) {
rmax = fmaxf(rmax, A[i*NCOL+j]);
}
__syncthreads();
for (int j = 0; j < NCOL; j++) {
A[i*NCOL+j]) /= rmax;
}
}
...
update_A_kernel<<<NROW, 1>>>(...);
```

Right now we launch the cuda kernel with *NROW* thread block and 1 thread, which is not efficient. We try to launch the kernel with 2D thread block e.g. *dim3(NROW, NCOL)* to remove the second for-loop but not idea how to deal with the first loop with **atomicMax**.

Typical sizes of NROW and NCOl is O(100-2000). can the performance be improved using a 2D thread block?

Thanks. /Jing

For the first loop, you are looking for an algorithm called parallel reduction. Use for example all 128 threads per block to perform a parallel reduction to find the maximum value. Then use all 128 threads to update `A`

.

Your posted code is finding the maximum value of each ROW, not each column. Yes, we could possibly have a difference of terminology, but the code itself seems to have a sense of row and column, and the thing that your first loop is iterating over to find the maximum is *the columns*, therefore I claim that the code itself has a sense of finding the maximum of each row:

```
for (int j = 0; j < NCOL; j++) { // this is iterating over the columns, along a single row
```

If you can transpose your data, you can use a method like the one you have shown, with a few minor changes, and get good efficiency.

Otherwise you will need to learn about parallel reduction techniques, as already mentioned, to get good memory efficiency. There are many many many writeups on various web forums about parallel reduction methods.

Because you are doing one reduction on each row, you will want to use a segmented parallel reduction. Here is a recent thread discussing a very simple example of a segmented parallel reduction.

Hi Robert,

Your posted code is finding the maximum value of each ROW, not each column.

Sorry, it was typo. It should find the maximum value for each row.

Because you are doing one reduction on each row, you will want to use a segmented parallel reduction. Here is a recent thread discussing a very simple example of a segmented parallel reduction.

I will look at it. Thanks for the information. /Jing