Efficient way of Initializing a dense matrix by sparse matrix (COO) at shared memory

Hi, All

I want to use the sparse matrix information (e.g., COO) to initialize a dense matrix.
I find my CUDA program’s major performance bottleneck here.

__shared__ float dense_A[H * W];

for (i = tid; i < non_zeros; i += threadPerBlock){
      col = colList[i];
      row = rowList[i];
	  dense_A[row * W + col] = 1.0f;	

Is there any suggestion for me to improve its performance?
Thanks a lot!