About multi thread and cusolver cost time

I’m using tesla k40m and cusolver to calculate matrix eigenvalue and eigenvector, the matrix is 16x16, the result is correct, and single thread cost 3.5ms, but when I create 32 threads and each thread creates a cuda stream for sync and execute cusolver function, the time is 116ms, but the GPU usage is just 37%, why do this? Does the cusolver can not use in multi thread?How can I solver this?

the task is so small that most part of this 3.5 ms should be a time spend in driver (pushing job to GPU and receiving the answer). i don’t know cusolver but the only way to improve performance is to use batch API if it is available. Or call cusolver routines directly from your GPU code