Hi, All
I want to use the sparse matrix information (e.g., COO) to initialize a dense matrix.
I find my CUDA program’s major performance bottleneck here.
__shared__ float dense_A[H * W];
for (i = tid; i < non_zeros; i += threadPerBlock){
col = colList[i];
row = rowList[i];
dense_A[row * W + col] = 1.0f;
}
}
Is there any suggestion for me to improve its performance?
Thanks a lot!