Multiple small matrix-vector multiplications

I am looking for the most efficient way to compute the matrix-vector product for a large number of small matrix-vector pairs. I am familiar with the BLAS routines for matrix-vector multiplication. However, because I have many small matrices to multiply by small vectors, I am trying to avoid launching a kernel for each individual product. Is there a way to use BLAS to compute many matrix-vector products with one call? Alternatively, is there a reference out there describing the best way to organize this problem?
Thanks
Matt Bakalar

I am looking for the most efficient way to compute the matrix-vector product for a large number of small matrix-vector pairs. I am familiar with the BLAS routines for matrix-vector multiplication. However, because I have many small matrices to multiply by small vectors, I am trying to avoid launching a kernel for each individual product. Is there a way to use BLAS to compute many matrix-vector products with one call? Alternatively, is there a reference out there describing the best way to organize this problem?
Thanks
Matt Bakalar

you can look at OPLib

http://www.level3finance.com/oplib.html

The author Claudio Albanese proposes BLAS4 operation which is useful when matrices and vectors are small.

you can look at OPLib

http://www.level3finance.com/oplib.html

The author Claudio Albanese proposes BLAS4 operation which is useful when matrices and vectors are small.