tip for improving sgemv type work?

I have 1000 separate matrix vertex operations I need to do as fast as possible. None of the data overlaps and all can be done in parallel.

It looks like:

(200,200) x (200) = (200)
(200,200) x (200) = (200)
(200,200) x (200) = (200)

Do I just create 1000 streams and fire off sgemv? Or is there a faster way to trigger the operations without huge kernel overhead?

Note all work stays on the gpu for further processing

it seems you can make (1000x200)matrix from 1000 vectors, then use sgemm

Unfortunately the 1000 200x200 matrices are also unique! So no chance to use sgemm

so… use ‘stream’ to overlap memcpy’s(host<->device) and kernel-exec’s

This paper mentioned how to coalesce multiple SGEMM. It involve streams and computation overlap.
If the parallelism of your hardware is not fully exploited, there is some room for you to improve the performance.

https://devblogs.nvidia.com/parallelforall/optimizing-recurrent-neural-networks-cudnn-5/

source code see here:
https://github.com/parallel-forall/code-samples/tree/master/posts/rnn