Scaling MULTI-GPU CUBLAS and CUFFT Developer Feedback

It seems like every new release of CUBLAS and CUFFT does a good job of improving performance over the last revision. Mainly because code is further optimized for the single device.

We wanted to get a conversation going regarding the following idea:

As developers, would you be willing to sacrifice single device optimal performance, for less optimal performance that can scale over Multi-GPUs?

The MAGMA project aims to scale to MULTI-GPUs the functions commonly found in LAPACK, and I was curious to know what developers feel about this new direction.

**I have no affiliation with MAGMA, but we’re curious about your thoughts since we are working on a similar unrelated project.