Personally I would prefer to use a while loop in the kernel code for simplicity. The simplest implementation like that, assuming you are working on a matrix of 10,000 columns or larger, is to have each thread loop through the elements in a given column. You launch one thread per column. This will coalesce nicely.

```
const int threshold = 100;
template <typename T>
__global__ void threshold_kernel(const T * __restrict__ data, const int ncol, const int nrow, int * __restrict__ result){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < ncol){
int my_result = -1;
int i = 0;
while (i < nrow) && (my_result == -1){
if (data[i*ncol+idx] > threshold) my_result = i;
i++;}
result[idx] = my_result;}
}
```

(coded in browser, not tested)

Even if your matrix is 1000 columns or larger, I would consider the above approach. The inefficiency associated with smaller thread count might not be improved given the additional complexity in other approaches.

If you have a small matrix, then you can do a parallel reduction per column, to attempt to increase the thread count.