Sparse matrix creation in parallel

Are there any papers on algorithms for efficient creation (element filling) of compressed sparse (CSR) matrix in parallel?

Since elements of matrix are computed parallely, I would like to avoid the copying of data from GPU to host, creating sparse matrix on host, and copying data again to GPU to solve the matrix.

Have the same problem. Filling a huge sparse matrix takes forever on host.