How to do Reduction in column for a matrix

Hi, suppose I have a matrix, I want to do reduction in y direction. In order to simplify this question, the matrix size is 4 X N, But it seems my code has bug, I am not sure what’s the problem, would you please give me some advices? thanks!

#include<cuda.h>
#include

template<class T, int tile>
global void sum_gpu_array(T *in, T out, int64_t N, size_t in_size) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

// in_size is 4
T total[4];

while (col < N) {
for (int i = row; i < in_size; i+= blockDim.y * gridDim.y) {
const T *tmp = in[i];
if (tmp)
total[i] = tmp[col];
}
col += blockDim.x * gridDim.x;
}

while (col < N) {
// reduction in y. final result in total[0]
for(int s=2; s>0; s>>=1) {
if (row < s) {
total[row] +=total[s+row];
}
}
if (row == 0)
out[col] = total[0];
col += blockDim.x * gridDim.x;
}
}

int main()
{
size_t fr_size = 200;
float * hin = new float[fr_size];
float d_in0, d_in1, d_in2, d_in3, d_out;
cudaMalloc((void
)&d_in0, fr_size
sizeof(float));
cudaMalloc((void
)&d_in1, fr_sizesizeof(float));
cudaMalloc((void**)&d_in2, fr_size
sizeof(float));
cudaMalloc((void**)&d_in3, fr_sizesizeof(float));
cudaMalloc((void**)&d_out, fr_size
sizeof(float));

auto fillh = [&] (){
for(int i=0; i < fr_size; ++i) {
hin[i] = 1.;
}
};
auto filld = [&](void *mem, void *src, int size) {
cudaMemcpy(mem, src, sizeof(float)*size, cudaMemcpyHostToDevice);
};
fillh();
filld(d_in0, hin, fr_size);
filld(d_in1, hin, fr_size);
filld(d_in2, hin, fr_size);
filld(d_in3, hin, fr_size);

float *harray[4] = {d_in0, d_in1, d_in2, d_in3};
float **darray;

cudaMalloc((void**)&darray, 4sizeof(float));
cudaMemcpy(darray, harray, sizeof(float*)*4, cudaMemcpyHostToDevice);

#define TILE_X 128
#define TILE_Y 2
#define CEIL(x, y) ((x) + (y) -1)/(y)

dim3 blocks = dim3(TILE_X, TILE_Y);
dim3 grids = dim3(CEIL(fr_size, TILE_X), CEIL(4, TILE_Y));
sum_gpu_array<float, TILE_Y><<<grids, blocks>>>(darray, d_out, fr_size, 4);

I’ve provided a response in your other thread that was asking essentially the same question (I think) with a possible proposal for how to realize one form of a multi-thread reduction per column.

[url]https://devtalk.nvidia.com/default/topic/1050216/cuda-programming-and-performance/how-to-add-pointer-array-value/post/5335141/#5335141[/url]

Many thanks for your detail explantation and brilliant code, I tried on v100, the performance improved almost 3 times for my test data!