Fast elementwise reduction for Multiple Matrix on GPU

Hi, All

Is there any way for the fast reduction of several matrices into one?

For example, originally I have 6 MxN matrics, and I want to reduce all these matrices into one matrix by adding all elements at the same position (x,y). And Finally, we get 1 MxN matrix as the output.

my current implementation is slow.

// mat_size = M * N;
// blocksize = 1024;
// kernel dim: <<< mat_size/blocksize,  blocksize>>>
// bit_mat:  [nbits, mat_size]
// s32_mat:  [mat_size]
// nbits = 6 
__global__ 
void bit_compostion(unsigned* bit_mat, unsigned* s32_mat, const unsigned nbits, const unsigned mat_size){
const unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < mat_size){
    # pragma unroll        
    for (unsigned cur_bit = 0; cur_bit < nbits; cur_bit++){
        const unsigned source_id = cur_bit * mat_size + tid;
        const unsigned target_id  = tid;

        s32_mat[target_id] += bit_mat[source_id];
    }
}

}

Thanks!