Does CUDA BLAS take advantage of multiple GPUs? Replace video card? Get more?


I’m putting together an algorithm whose memory requirements are dominated by a big matrix whose size is proportional to N^2, where N is the size of the input data. I’m using CUDA BLAS to do operations on this matrix.

My situation is such that the bigger I can make this matrix, the better my results are. I wasn’t able to find any documentation on just how CUDA BLAS handles multiple GPUs, or SLI-configured cards. I’m hoping someone can answer the following questions (or tell me where to look):

-Does CUDA BLAS recognize multiple cards and take advantage of them?

-What does CUDA BLAS do in the presence of two cards configured with SLI? Does it look like I have one big card with twice the memory and twice the computing power?

-If I can’t expand the functionality of CUDA BLAS past a single card, is there something else I can do to access optimized BLAS oeprations on multiple devices?


In general, CUDA requires that SLI be disabled in the software driver and that you manually divide up the work between cards if you have more than one. I have no experience with CUBLAS, but would not be surprised if it worked the same way.

Since your problem is limited by memory capacity, partitioning it will probably require decomposing your N x N matrix into sub-matrices in some way, and then splitting up your operations between the cards. This will get awkward if you have to do matrix-matrix multiplications, since you’ll want to shuffle data between cards, which means going through host memory.

Explicit device<->device memory copies between cards through the SLI bridge would be a neat feature for cases like this, but no such feature has been mentioned by NVIDIA ever.

Well, one of the problems is that the CUDA BLAS guide doesn’t describe any library calls that manage which GPU to use if more than one are present, or how to send a command to one card and not another.

Since CUBLAS is implemented on top of CUDA, I would assume the multiple GPUs are handled the same way as general CUDA programs. You have to spawn a CPU thread for each GPU and call cudaSetDevice() in each thread to associate it with a particular card. Then all future CUDA calls in each thread will be associated with the appropriate GPU.

(Yes, this is a huge pain. It would be nice if there were an optional way in CUDA to use multiple GPUs without having to manage host threads.)

The CUBLAS source is also now available, so you can see directly how it works: