Hello
Since CUDA 4.0 will improve communication between GPUs, will CUBLAS be rewritten to take advantage of multi-gpu computing? For example, matrix multiplication for massively large matrices could greatly benefit by dividing the input across 2 GPUs, doing the matrix multiplication, and then merging them back.
Also, since CUDA 4 will work with MPI, will NVIDIA release any libraries on matrix operations on a CUDA cluster?