Find multiple minima in parallel

I have the following problem: given a large matrix stored in the device, I have to find the minimum for each row. I need to find this in parallel.
A solution would be to use use several cublaIsamin from cuBLAS library, using streams. But the number of rows is much bigger than the number of streams I have at disposal (they are constrained by some other project details).
How can I do?
Of course I can manage to reimplememt the algorithm by myself but I don’t like to reinvent the wheel, and my wheel will probably be less efficient then the one of some existing library.

cub and thrust can both do segmented reductions. You can find various examples in various forum posts, here is one of a thrust segmented reduction. Here is another.

1 Like