How to efficently copy the non-zero elements in an array to another array

I want to copy all the non-zero elements in a 1-D array to another array output, I try to do this at warp-level, like

  1. each warp can use __ballot_sync to find out how many non-zero elements in the warp.
  2. __sync_threads
  3. each warp find it start pos and end pos write in the output array.

so my question is how to map the lane_id to the idx in the output array in the warp? I’ll appreciate if anyone can give me some advice.

I already find a relative post which use thrust library, it use the copy_if function, but how to use naive cuda to implement this?

I also find a code, but it is done in the block level, and I think frequently atomic add idx per thread might not be the best choice, in this blog, it shows it can be done by just one thread in a warp adding n instead of n thread adding one

__global__ void gpu_Xn(int *pHist, int pnN, int* pXn) 
{
    int Tid ;
    Tid = threadIdx.x ;

    __shared__ int tmpXn[256] ;
    __shared__ int idx ;

    tmpXn[Tid] = -1 ;
    if(Tid == 0) idx = 0  ;

    __syncthreads() ;

    if(pHist[Tid] !=0)
    {
        int x = atomicAdd(&idx, 1) ; 
        tmpXn[x] = Tid ;
    }

    __syncthreads() ;
    if(Tid < pnN)
        pXn[Tid] = tmpXn[Tid] ;
}