I have the following problem: given a large matrix stored in the device, I have to find the minimum for each row. I need to find this in parallel.
A solution would be to use use several cublaIsamin from cuBLAS library, using streams. But the number of rows is much bigger than the number of streams I have at disposal (they are constrained by some other project details).
How can I do?
Of course I can manage to reimplememt the algorithm by myself but I don’t like to reinvent the wheel, and my wheel will probably be less efficient then the one of some existing library.