Tiled cublas gemm on multiple GPUs


I’m seeking an efficient implementation that uses streams and concurrency to do matrix multiplications using multiple GPUs on arbitrary large matrices, i.e., similar to what’s described in the webinar on streams and concurrency.
I really appreciate if somebody can provide with such an implementation, or at least parts of it.

Kind regards Toke

Do you mean partition one matrix over multiple gpus/streams?
If not, you can have different streams calling cublas using cublasSetStream().
You could also use dynamic parallism today on sm 3.5 with cublas.



I mean one big matrix over multiple GPUs. Just like explained here: http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf