CUDA C parallel computing problem

How to parallel implementation on CUDA C for a matrix to identify each column of the non-zero elements and the corresponding element of the row into the corresponding columns of the array, such as a matrix, 512 * 512 to achieve 512 column of the non-zero elements of each column, each column of the non-zero elements of the line to the corresponding coordinate arrays, each column of the non-zero elements is unknown, so the array size is randomly assigned