Ax=v for x with
A being a 2d square matrix and
v a 1d vector.
Typical size of a matrix: 9x9
Typical number of matrices: >=1e6
Current code using cuBLAS
- create handle
- cudaMalloc double A batchSizeNNsizeof(double)
- fill double **
vptrswith addresses of each small 9x9 matrix stored in A and v
setAandV<<numBlocks, numThreads>>(A, v)
- use results stored in
- destroy: free memory and destroy handle
It turns out setting the values in
v is costly, possibly due to the low-speed of global memory where they are stored.
getrfalso seems nontrivial, but I guess there is nothing I can do about it.
Any suggestions to improve the strategy?
- Can we use malloc or new in the thread and store their address in
Aptrto be used in
- Can we iterate over all cells and call cuSolver’s getrf an getrs for each cell? I feel this would be slow.