I want to copy all the non-zero elements in a 1-D array to another array output, I try to do this at warp-level, like
- each warp can use __ballot_sync to find out how many non-zero elements in the warp.
- __sync_threads
- each warp find it start pos and end pos write in the output array.
so my question is how to map the lane_id to the idx in the output array in the warp? I’ll appreciate if anyone can give me some advice.
I already find a relative post which use thrust library, it use the copy_if function, but how to use naive cuda to implement this?
I also find a code, but it is done in the block level, and I think frequently atomic add idx per thread might not be the best choice, in this blog, it shows it can be done by just one thread in a warp adding n instead of n thread adding one
__global__ void gpu_Xn(int *pHist, int pnN, int* pXn)
{
int Tid ;
Tid = threadIdx.x ;
__shared__ int tmpXn[256] ;
__shared__ int idx ;
tmpXn[Tid] = -1 ;
if(Tid == 0) idx = 0 ;
__syncthreads() ;
if(pHist[Tid] !=0)
{
int x = atomicAdd(&idx, 1) ;
tmpXn[x] = Tid ;
}
__syncthreads() ;
if(Tid < pnN)
pXn[Tid] = tmpXn[Tid] ;
}