How to push back thread index which pass a condition in cuda kernel like numpy's “Where” op?

dedoogong · April 19, 2021, 9:52am

0

For example, if I have a array like

condition=[
            [0,0,1,0,0],
            [1,1,0,0,0],
            [0,0,1,0,0],
            [0,0,0,1,0],
            [0,0,0,0,1]
          ]

then I want to get idx_x=[2, 0, 1, 2, 3, 4], idx_y = [0, 1, 1, 2, 3, 4]

I wrote an kernel which works wrongly;

__global__ void nonzero_kernel(const float* input, float* output, 
        const int iH, const int iW) {
    const size_t gidx = blockIdx.x * blockDim.x + threadIdx.x;
    if ( gidx >= iH*iW ) return;
    if (input[gidx]==1.0f){
        output[gidx]       = (gidx)%iW;//x
        output[gidx+iH*iW] = (gidx)/iW;//y
    }
    else{
        output[gidx]       = -1;//x
        output[gidx+iH*iW] = -1;//y
    }
}

as it returns dummy values(-1) like ;

[[[-1. -1.  2. -1. -1.]
  [ 0.  1. -1. -1. -1.]
  [-1. -1.  2. -1. -1.]
  [-1. -1. -1.  3. -1.]
  [-1. -1. -1. -1.  4.]]

 [[-1. -1.  0. -1. -1.]
  [ 1.  1. -1. -1. -1.]
  [-1. -1.  2. -1. -1.]
  [-1. -1. -1.  3. -1.]
  [-1. -1. -1. -1.  4.]]]

dedoogong · April 19, 2021, 9:57am

to achive my goal, I guess, I need to make a global, shared index variable which is increased whenever any thread push-back it’s own g_index into the output vector like

__shared__ out_index;

if (input[gidx]==1.0f){
    sync_threads();
    output[out_index]=(gidx)%iW;
    output[out_index+1]=(gidx)/iW;
    out_index+=2;
    sync_threads();
}

but it would be so slow;

Robert_Crovella · April 19, 2021, 11:48pm

This is in the category of problems solvable using parallel stream compaction.

A simplistic approach could be realized using an atomic push_back. While this method can give you all the correct locations, there is no guarantee of the ordering of those locations in the result vector(s).

Instead, the general approach for parallel stream compaction involves, as suggested to you in your cross-posting, a parallel prefix sum, followed by an indexed copy.

Thrust can do this in a single thrust call, using, for example, copy_if.

Topic		Replies	Views
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1304	October 30, 2022
How to parallel a seirial code CUDA Programming and Performance	4	726	March 16, 2018
Incrementing a "counter" CUDA Programming and Performance	5	1849	January 28, 2013
copy global memory by CUDA threads CUDA Programming and Performance	3	1208	January 17, 2011
copy global memory by kernel threads CUDA Programming and Performance	1	5956	January 23, 2011
How to efficently copy the non-zero elements in an array to another array CUDA Programming and Performance cuda	1	424	December 28, 2023
shared iterator for all threads CUDA Programming and Performance	1	594	May 7, 2015
write results in parallel creating an unknown number of data elements in each thread CUDA Programming and Performance	5	2322	January 21, 2010
threads writing to an array position dependent on a comparison result... CUDA Programming and Performance	5	2760	March 2, 2010
cuda array going from 3D index to linear index CUDA Programming and Performance	3	1938	August 7, 2014

How to push back thread index which pass a condition in cuda kernel like numpy's “Where” op?

Related topics