CUDA-LAPACK Availability/Field of Application

Let’s say you have 1000 small 9x9 matrices on which you want to perform
a singular value decomposition or some other LAPACK calculation.

Can we expect a significant performance gain if a proprietary CUDA-LAPACK comes out?
Is CUDA-LAPACK planed, when will it come out?

If I am right, at the moment a speedup is only achieved if you have quite
big matrices, because a normal LAPACK has to transfer each
matrix alone for CUDA-BLAS , whereas a CUDA-LAPACK could
transfer and process many small matrices at once!?