Hi, All
Is there any way for the fast reduction of several matrices into one?
For example, originally I have 6 MxN matrics, and I want to reduce all these matrices into one matrix by adding all elements at the same position (x,y). And Finally, we get 1 MxN matrix as the output.
my current implementation is slow.
// mat_size = M * N;
// blocksize = 1024;
// kernel dim: <<< mat_size/blocksize, blocksize>>>
// bit_mat: [nbits, mat_size]
// s32_mat: [mat_size]
// nbits = 6
__global__
void bit_compostion(unsigned* bit_mat, unsigned* s32_mat, const unsigned nbits, const unsigned mat_size){
const unsigned tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < mat_size){
# pragma unroll
for (unsigned cur_bit = 0; cur_bit < nbits; cur_bit++){
const unsigned source_id = cur_bit * mat_size + tid;
const unsigned target_id = tid;
s32_mat[target_id] += bit_mat[source_id];
}
}
}
Thanks!