The above program is not correct. Each iteration is dependent on a new value of id because of the reference to src[id]. I don’t see a way to parallelize the algorithm because you have to order the values of i and j.
What are you trying to do? To take advantage of CUDA (or any parallel methods) you are going to have to change your algorithm.
Above your Cuda Program is not correct but I think that your program can be parallelize.
I mean that not all your algorithm can parallelize. You can split your algorithm in to 2 steps.
First step is find all “!=0” index of each column and store in an 2D array. (parallel algorithm)
Second step is calculate the “Id” variable (serial algorithm)
I don’t think that faster than you use original serial algorithm.