I have 10s of matrices of size of order 1000 x 1000 to decompose. cuSOLVER batched svd is for 32 x 32 only. I am thinking about launching a separate svd routines on different streams. Since I am using A100, I think it should be possible to let each SM do one matrix. Is it possible? What do I need to do for that to happen (perhaps setting launch bounds)?
There isn’t anything you can do with cusolver to limit a particular op to 1 SM or in any way directly restrict its footprint of GPU resource utilization.
If your question is about cusolver, please ask it on the libraries forum.
Dense matrices of 1000x1000 may give good utilization of currently available GPUs without the need for further parallelization.
May I ask if it is possible to use Multi CUDA Streams in combination with the CuSolver to do the batched SVDs in parallel?
It should be possible. Whether it would provide any performance benefit would probably depend on a number of factors, such as the GPU in question and the problem sizes involved. I don’t have any performance guidance beyond that and I’m not aware of any sample codes that are set up already to demonstrate SVDs in parallel. If you want to to try it, the samples available in the library samples may be of interest.
My comments mostly apply to running a few “medium size” SVDs (non-batched) in parallel (and I don’t know if there would be any benefit.) It’s not clear why you would need to parallelize batched SVDs if the problem sizes are consistent. Just increase the batch size.
Would it be possible for you to share the application details of batched SVD with us? E.g., project or area (research/enterprise), typical matrix sizes and number of matrices?
Hi, thanks for the reply. My areas of research are quantum many-body physics and tensor networks (related to CuTensorNet and CuQuantum but I am not using them). The example matrix sizes are, for a batch size of a few thousands medium-size matrices (eg., batch size 1024), compute the batched svd (or eigh of the AA^T) of:
Matrix Size: (8, 16)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.006456 seconds
Matrix Size: (18, 54)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.008967 seconds
Matrix Size: (32, 128)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.012992 seconds
Matrix Size: (50, 250)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.180654 seconds
Matrix Size: (72, 432)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.279291 seconds
Matrix Size: (98, 686)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.470683 seconds
Matrix Size: (128, 1024)
Batched SVD PyTorch(driver=gesvda) - Avg time: 0.693446 seconds
Matrix Size: (162, 1458)
Batched SVD PyTorch(driver=gesvda) - Avg time: 1.363434 seconds
Matrix Size: (8, 16)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 0.003317 seconds
Matrix Size: (18, 54)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 0.927196 seconds
Matrix Size: (32, 128)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 1.332342 seconds
Matrix Size: (50, 250)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 2.592348 seconds
Matrix Size: (72, 432)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 4.537686 seconds
Matrix Size: (98, 686)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 7.076421 seconds
Matrix Size: (128, 1024)
Batched SVD PyTorch(driver=gesvdj) - Avg time: 6.798602 seconds