Having multiple relatively small problems

Hello everyone.

I have a question about using CUDA functionalities like cublasDgemm and cusolverDnDgetrf and cusolverDnDgetrs. In the documentation they say that the user has no control on the amount of threads and memory which is dedicated when using these functions.

Now in my case I relatively small matrices to operate on, my matrices are of sizes from 100x100 to 400x400. However, I have multiple problem to solve simultaneously: let’s say I have to solve a set of systems Ax = b, for which the computation of A and b can be done with some matrix multiplications. Now I would like to solve N times this system in parallel (as it is composed by relatively small matrices).

The fact is that I cannot tell to these function how much hardware they can take control of, and I do not know how to split the hardware among them.

One approach would be to use the batched functions, There are batched gemm functions in cublas as well as batched potrf functions in cusolver. cublas also has batched getrf/getri/getrs functionality.

As you’ve already stated, you won’t be able to tell (any cublas or cusolver) functions how much hardware to use, or how to split the hardware.

At matrix sizes of 400x400, you’re probably better off using the non-batched functions. To some extent the recommendation will be dependent on the GPU you are running on. A 400x400 matrix should be able to come pretty close to saturating a V100 GPU.

1 Like

Thank you,

After a brief search on google I only found an example in FORTRAN but I am using C++.

I have an NVIDIA RTX A3000 Laptop GPU.

Can you provide me a link for learning such batched functions ?

Thank you very much for your help.

This example should help CUDALibrarySamples/cuBLAS/Level-3/gemmBatched at master · NVIDIA/CUDALibrarySamples · GitHub

1 Like

potrf batched: If you go to the documentation for the function, it links to this example.

cublas gemm batched: it actually has its own blog article but here is another example

cublas getri/getrf: take a look at stack overflow. There are a number of examples of usage of batched getri/getrf in C++. Also the CUDA sample codes batchCUBLAS and simpleCUBLAS_LU demonstrate, also.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.