I’m using tesla k40m and cusolver to calculate matrix eigenvalue and eigenvector, the matrix is 16x16, the result is correct, and single thread cost 3.5ms, but when I create 32 threads and each thread creates a cuda stream for sync and execute cusolver function, the time is 116ms, but the GPU usage is just 37%, why do this? Does the cusolver can not use in multi thread?How can I solver this?
the task is so small that most part of this 3.5 ms should be a time spend in driver (pushing job to GPU and receiving the answer). i don’t know cusolver but the only way to improve performance is to use batch API if it is available. Or call cusolver routines directly from your GPU code