Get first row index that meets the condition in cuda

kuszi · April 14, 2016, 8:57am

hi

below you can see a little pseudo code what i want to implement id cuda.
Get the first row index in every column where the value is bigger than 100
If its possible, i do not want to write “while” in my gpu code

myArray[4096x1024]
for(column 0 to 4095)
{
	row = 0;
	while (row++ < 1024 && myArray[row * 4096 + column] > 100)
	{
	}
	
	printf("row: %i", row);
}

is it possible, and how?

thx
charles

Robert_Crovella · April 14, 2016, 5:00pm

Personally I would prefer to use a while loop in the kernel code for simplicity. The simplest implementation like that, assuming you are working on a matrix of 10,000 columns or larger, is to have each thread loop through the elements in a given column. You launch one thread per column. This will coalesce nicely.

const int threshold = 100;
template <typename T>
__global__ void threshold_kernel(const T * __restrict__ data, const int ncol, const int nrow, int * __restrict__ result){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < ncol){
    int my_result = -1;
    int i = 0;
    while (i < nrow) && (my_result == -1){
      if (data[i*ncol+idx] > threshold) my_result = i;
      i++;}
    result[idx] = my_result;}
}

(coded in browser, not tested)

Even if your matrix is 1000 columns or larger, I would consider the above approach. The inefficiency associated with smaller thread count might not be improved given the additional complexity in other approaches.

If you have a small matrix, then you can do a parallel reduction per column, to attempt to increase the thread count.

kuszi · April 15, 2016, 11:26am

thx for the reply
currently Iam using almost the same code (except mine has much more complex condition)
and I was just thinking about are the any solution without loop, but I could not found

Robert_Crovella · April 15, 2016, 1:32pm

If you did a parallel reduction per column, you could eliminate the loop, but you would launch 1000 times as many threads, if your columns have 1000 elements in them.

I’m not going to write all that code out for you, however. If you want to learn how to write a parallel reduction, there is good material available, just do a google search on “mark harris parallel reduction”

And I’m not sure it would be any faster than launching your 4096 threads, each with a loop.

Topic		Replies	Views
Simulating going through array from top to bottom in parallel CUDA Programming and Performance	1	435	February 27, 2018
Finding the min and max index of values in a 1D array greater than particular threshold CUDA Programming and Performance	6	507	March 11, 2024
Find maximum values for each column of 2D array CUDA Programming and Performance	3	1052	May 5, 2022
Matrix help CUDA Programming and Performance	13	1596	September 28, 2010
Column wise reduction of column major matrix CUDA Programming and Performance	0	608	February 23, 2016
CUDA array filtering kernel without a for loop CUDA Programming and Performance cuda	1	453	May 26, 2021
how to replace a row in matrix by another according to a condition in CUDA CUDA Programming and Performance	2	544	February 20, 2017
How to push back thread index which pass a condition in cuda kernel like numpy's “Where” op? CUDA Programming and Performance	2	466	April 19, 2021
How to index last element of a row/column of an array selectively index specific elements of an arra CUDA Programming and Performance	5	5919	November 9, 2010
find maximum value in an array along with index CUDA Programming and Performance	40	19071	October 11, 2010

Get first row index that meets the condition in cuda

Related topics