dgeev Lapack call CUDA batch mode?

I am trying to find a CUDA equivalent of dgeev function call from LAPACK.

I compiled magma-1.1.0 on a Tesla C2070 and tested the dgeev function which benchmarks for matrices from size 1024 to 8064. It’s interesting to see the results for a 1024x1024 matrix, where GPU takes more time than the CPU.

N     CPU Time(s)    GPU Time(s)     ||R||_F / ||A||_F

==========================================================

 <b>1024      31.66          51.06</b>

 2048     251.49         138.11

 3072     515.84         322.13

 4032     738.23         578.76

 5184     1429.96         793.89

 6016     1634.60         1136.89

 7040     2171.73         1432.91

 8064     3345.07         1625.88

I am trying to see if I can use dgeev for a 10x10 matrix 100,000 times (i.e. in burst mode).

In this scenario, each thread on the GPU solves for a 10x10 matrix. Therefore, assuming 64 threads are called, 64 10x10 matrices would be solved parallelising the whole operation.

Any suggestions on a CUDA library that can handle this??

PS: I have looked at CULA R12 and haven’t found anything on their forums that suggest a burst mode for small matrices.

Thanks in advance.