Hi!
I’m seeking an efficient implementation that uses streams and concurrency to do matrix multiplications using multiple GPUs on arbitrary large matrices, i.e., similar to what’s described in the webinar on streams and concurrency.
I really appreciate if somebody can provide with such an implementation, or at least parts of it.
Kind regards Toke