How to push back thread index which pass a condition in cuda kernel like numpy's “Where” op?

0

For example, if I have a array like

condition=[
            [0,0,1,0,0],
            [1,1,0,0,0],
            [0,0,1,0,0],
            [0,0,0,1,0],
            [0,0,0,0,1]
          ]

then I want to get idx_x=[2, 0, 1, 2, 3, 4], idx_y = [0, 1, 1, 2, 3, 4]

I wrote an kernel which works wrongly;

__global__ void nonzero_kernel(const float* input, float* output, 
        const int iH, const int iW) {
    const size_t gidx = blockIdx.x * blockDim.x + threadIdx.x;
    if ( gidx >= iH*iW ) return;
    if (input[gidx]==1.0f){
        output[gidx]       = (gidx)%iW;//x
        output[gidx+iH*iW] = (gidx)/iW;//y
    }
    else{
        output[gidx]       = -1;//x
        output[gidx+iH*iW] = -1;//y
    }
} 

as it returns dummy values(-1) like ;

[[[-1. -1.  2. -1. -1.]
  [ 0.  1. -1. -1. -1.]
  [-1. -1.  2. -1. -1.]
  [-1. -1. -1.  3. -1.]
  [-1. -1. -1. -1.  4.]]

 [[-1. -1.  0. -1. -1.]
  [ 1.  1. -1. -1. -1.]
  [-1. -1.  2. -1. -1.]
  [-1. -1. -1.  3. -1.]
  [-1. -1. -1. -1.  4.]]]

to achive my goal, I guess, I need to make a global, shared index variable which is increased whenever any thread push-back it’s own g_index into the output vector like

__shared__ out_index;

if (input[gidx]==1.0f){
    sync_threads();
    output[out_index]=(gidx)%iW;
    output[out_index+1]=(gidx)/iW;
    out_index+=2;
    sync_threads();
}

but it would be so slow;

This is in the category of problems solvable using parallel stream compaction.

A simplistic approach could be realized using an atomic push_back. While this method can give you all the correct locations, there is no guarantee of the ordering of those locations in the result vector(s).

Instead, the general approach for parallel stream compaction involves, as suggested to you in your cross-posting, a parallel prefix sum, followed by an indexed copy.

Thrust can do this in a single thrust call, using, for example, copy_if.