I have a question about using CUDA functionalities like cublasDgemm and cusolverDnDgetrf and cusolverDnDgetrs. In the documentation they say that the user has no control on the amount of threads and memory which is dedicated when using these functions.
Now in my case I relatively small matrices to operate on, my matrices are of sizes from 100x100 to 400x400. However, I have multiple problem to solve simultaneously: let’s say I have to solve a set of systems Ax = b, for which the computation of A and b can be done with some matrix multiplications. Now I would like to solve N times this system in parallel (as it is composed by relatively small matrices).
The fact is that I cannot tell to these function how much hardware they can take control of, and I do not know how to split the hardware among them.
One approach would be to use the batched functions, There are batched gemm functions in cublas as well as batched potrf functions in cusolver. cublas also has batched getrf/getri/getrs functionality.
As you’ve already stated, you won’t be able to tell (any cublas or cusolver) functions how much hardware to use, or how to split the hardware.
At matrix sizes of 400x400, you’re probably better off using the non-batched functions. To some extent the recommendation will be dependent on the GPU you are running on. A 400x400 matrix should be able to come pretty close to saturating a V100 GPU.
cublas gemm batched: it actually has its own blog article but here is another example
cublas getri/getrf: take a look at stack overflow. There are a number of examples of usage of batched getri/getrf in C++. Also the CUDA sample codes batchCUBLAS and simpleCUBLAS_LU demonstrate, also.