Get first row index that meets the condition in cuda


below you can see a little pseudo code what i want to implement id cuda.
Get the first row index in every column where the value is bigger than 100
If its possible, i do not want to write “while” in my gpu code

for(column 0 to 4095)
	row = 0;
	while (row++ < 1024 && myArray[row * 4096 + column] > 100)
	printf("row: %i", row);

is it possible, and how?


Personally I would prefer to use a while loop in the kernel code for simplicity. The simplest implementation like that, assuming you are working on a matrix of 10,000 columns or larger, is to have each thread loop through the elements in a given column. You launch one thread per column. This will coalesce nicely.

const int threshold = 100;
template <typename T>
__global__ void threshold_kernel(const T * __restrict__ data, const int ncol, const int nrow, int * __restrict__ result){

  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  if (idx < ncol){
    int my_result = -1;
    int i = 0;
    while (i < nrow) && (my_result == -1){
      if (data[i*ncol+idx] > threshold) my_result = i;
    result[idx] = my_result;}

(coded in browser, not tested)

Even if your matrix is 1000 columns or larger, I would consider the above approach. The inefficiency associated with smaller thread count might not be improved given the additional complexity in other approaches.

If you have a small matrix, then you can do a parallel reduction per column, to attempt to increase the thread count.

thx for the reply
currently Iam using almost the same code (except mine has much more complex condition)
and I was just thinking about are the any solution without loop, but I could not found

If you did a parallel reduction per column, you could eliminate the loop, but you would launch 1000 times as many threads, if your columns have 1000 elements in them.

I’m not going to write all that code out for you, however. If you want to learn how to write a parallel reduction, there is good material available, just do a google search on “mark harris parallel reduction”

And I’m not sure it would be any faster than launching your 4096 threads, each with a loop.