The example from the cuBLAS documentation for using Dsvyed seems to be hogging up the entire GPU for its process, how would one go about parallelizing this for several matrices? Or at least dedicate a certain number of blocks for it?

I have the following data pipeline:

- 1D array (matrix struct) ->
- multiply by its transpose to get diagonal matrix ->
- take eigenvalues and eigenvectors using Dsvyed example ->
- do more matrix things

so far I got 1,2,4 parallelized as **global** functions called from inside a main function, but can only call #3 once. Is there a way to do this or would it make mroe sense to use another approach / library to distribute the load, for say 100 large-matrix eigen-computations?