I’m putting together an algorithm whose memory requirements are dominated by a big matrix whose size is proportional to N^2, where N is the size of the input data. I’m using CUDA BLAS to do operations on this matrix.
My situation is such that the bigger I can make this matrix, the better my results are. I wasn’t able to find any documentation on just how CUDA BLAS handles multiple GPUs, or SLI-configured cards. I’m hoping someone can answer the following questions (or tell me where to look):
-Does CUDA BLAS recognize multiple cards and take advantage of them?
-What does CUDA BLAS do in the presence of two cards configured with SLI? Does it look like I have one big card with twice the memory and twice the computing power?
-If I can’t expand the functionality of CUDA BLAS past a single card, is there something else I can do to access optimized BLAS oeprations on multiple devices?